Cargando…

A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data

BACKGROUND: Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic fr...

Descripción completa

Detalles Bibliográficos
Autores principales: Dai, Xiaofeng, Erkkilä, Timo, Yli-Harja, Olli, Lähdesmäki, Harri
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2717092/
https://www.ncbi.nlm.nih.gov/pubmed/19480678
http://dx.doi.org/10.1186/1471-2105-10-165
_version_ 1782169863237664768
author Dai, Xiaofeng
Erkkilä, Timo
Yli-Harja, Olli
Lähdesmäki, Harri
author_facet Dai, Xiaofeng
Erkkilä, Timo
Yli-Harja, Olli
Lähdesmäki, Harri
author_sort Dai, Xiaofeng
collection PubMed
description BACKGROUND: Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible. RESULTS: This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC). CONCLUSION: Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.
format Text
id pubmed-2717092
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27170922009-07-29 A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data Dai, Xiaofeng Erkkilä, Timo Yli-Harja, Olli Lähdesmäki, Harri BMC Bioinformatics Methodology Article BACKGROUND: Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible. RESULTS: This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC). CONCLUSION: Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction. BioMed Central 2009-05-29 /pmc/articles/PMC2717092/ /pubmed/19480678 http://dx.doi.org/10.1186/1471-2105-10-165 Text en Copyright © 2009 Dai et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Dai, Xiaofeng
Erkkilä, Timo
Yli-Harja, Olli
Lähdesmäki, Harri
A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data
title A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data
title_full A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data
title_fullStr A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data
title_full_unstemmed A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data
title_short A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data
title_sort joint finite mixture model for clustering genes from independent gaussian and beta distributed data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2717092/
https://www.ncbi.nlm.nih.gov/pubmed/19480678
http://dx.doi.org/10.1186/1471-2105-10-165
work_keys_str_mv AT daixiaofeng ajointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata
AT erkkilatimo ajointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata
AT yliharjaolli ajointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata
AT lahdesmakiharri ajointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata
AT daixiaofeng jointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata
AT erkkilatimo jointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata
AT yliharjaolli jointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata
AT lahdesmakiharri jointfinitemixturemodelforclusteringgenesfromindependentgaussianandbetadistributeddata