Cargando…

caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data

BACKGROUND: The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhu, Yitan, Li, Huai, Miller, David J, Wang, Zuyi, Xuan, Jianhua, Clarke, Robert, Hoffman, Eric P, Wang, Yue
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2566986/ https://www.ncbi.nlm.nih.gov/pubmed/18801195 http://dx.doi.org/10.1186/1471-2105-9-383

_version_	1782159976756674560
author	Zhu, Yitan Li, Huai Miller, David J Wang, Zuyi Xuan, Jianhua Clarke, Robert Hoffman, Eric P Wang, Yue
author_facet	Zhu, Yitan Li, Huai Miller, David J Wang, Zuyi Xuan, Jianhua Clarke, Robert Hoffman, Eric P Wang, Yue
author_sort	Zhu, Yitan
collection	PubMed
description	BACKGROUND: The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables. RESULTS: In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine (divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with minor algorithm modifications customized to each of these tasks. CONCLUSION: VISDA achieved robust and superior clustering accuracy, compared with several benchmark clustering schemes. The model order selection scheme in VISDA was shown to be effective for high dimensional genomic data clustering. On muscular dystrophy data and muscle regeneration data, VISDA identified biologically relevant co-expressed gene clusters. VISDA also captured the pathological relationships among different phenotypes revealed at the molecular level, through phenotype clustering on muscular dystrophy data and multi-category cancer data.
format	Text
id	pubmed-2566986
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-25669862008-10-14 caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data Zhu, Yitan Li, Huai Miller, David J Wang, Zuyi Xuan, Jianhua Clarke, Robert Hoffman, Eric P Wang, Yue BMC Bioinformatics Methodology Article BACKGROUND: The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables. RESULTS: In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine (divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenotype clustering (wherein phenotype labels for samples are known), albeit with minor algorithm modifications customized to each of these tasks. CONCLUSION: VISDA achieved robust and superior clustering accuracy, compared with several benchmark clustering schemes. The model order selection scheme in VISDA was shown to be effective for high dimensional genomic data clustering. On muscular dystrophy data and muscle regeneration data, VISDA identified biologically relevant co-expressed gene clusters. VISDA also captured the pathological relationships among different phenotypes revealed at the molecular level, through phenotype clustering on muscular dystrophy data and multi-category cancer data. BioMed Central 2008-09-18 /pmc/articles/PMC2566986/ /pubmed/18801195 http://dx.doi.org/10.1186/1471-2105-9-383 Text en Copyright © 2008 Zhu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Zhu, Yitan Li, Huai Miller, David J Wang, Zuyi Xuan, Jianhua Clarke, Robert Hoffman, Eric P Wang, Yue caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data
title	caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data
title_full	caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data
title_fullStr	caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data
title_full_unstemmed	caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data
title_short	caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data
title_sort	cabig™ visda: modeling, visualization, and discovery for cluster analysis of genomic data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2566986/ https://www.ncbi.nlm.nih.gov/pubmed/18801195 http://dx.doi.org/10.1186/1471-2105-9-383
work_keys_str_mv	AT zhuyitan cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata AT lihuai cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata AT millerdavidj cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata AT wangzuyi cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata AT xuanjianhua cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata AT clarkerobert cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata AT hoffmanericp cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata AT wangyue cabigvisdamodelingvisualizationanddiscoveryforclusteranalysisofgenomicdata

caBIG™ VISDA: Modeling, visualization, and discovery for cluster analysis of genomic data

Ejemplares similares