Cargando…

Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approxim...

Descripción completa

Detalles Bibliográficos
Autores principales: Crook, Oliver M., Gatto, Laurent, Kirk, Paul D.W.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7614016/
https://www.ncbi.nlm.nih.gov/pubmed/31829970
http://dx.doi.org/10.1515/sagmb-2018-0065
_version_ 1783605553521491968
author Crook, Oliver M.
Gatto, Laurent
Kirk, Paul D.W.
author_facet Crook, Oliver M.
Gatto, Laurent
Kirk, Paul D.W.
author_sort Crook, Oliver M.
collection PubMed
description The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel
format Online
Article
Text
id pubmed-7614016
institution National Center for Biotechnology Information
language English
publishDate 2019
record_format MEDLINE/PubMed
spelling pubmed-76140162023-01-03 Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics Crook, Oliver M. Gatto, Laurent Kirk, Paul D.W. Stat Appl Genet Mol Biol Article The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel 2019-12-12 2019-12-12 /pmc/articles/PMC7614016/ /pubmed/31829970 http://dx.doi.org/10.1515/sagmb-2018-0065 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under the Creative Commons Attribution 4.0 Public License https://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Crook, Oliver M.
Gatto, Laurent
Kirk, Paul D.W.
Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics
title Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics
title_full Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics
title_fullStr Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics
title_full_unstemmed Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics
title_short Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics
title_sort fast approximate inference for variable selection in dirichlet process mixtures, with an application to pan-cancer proteomics
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7614016/
https://www.ncbi.nlm.nih.gov/pubmed/31829970
http://dx.doi.org/10.1515/sagmb-2018-0065
work_keys_str_mv AT crookoliverm fastapproximateinferenceforvariableselectionindirichletprocessmixtureswithanapplicationtopancancerproteomics
AT gattolaurent fastapproximateinferenceforvariableselectionindirichletprocessmixtureswithanapplicationtopancancerproteomics
AT kirkpauldw fastapproximateinferenceforvariableselectionindirichletprocessmixtureswithanapplicationtopancancerproteomics