Cargando…

Model order selection for bio-molecular data clustering

BACKGROUND: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bertoni, Alberto, Valentini, Giorgio
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892076/ https://www.ncbi.nlm.nih.gov/pubmed/17493256 http://dx.doi.org/10.1186/1471-2105-8-S2-S7

_version_	1782133821611704320
author	Bertoni, Alberto Valentini, Giorgio
author_facet	Bertoni, Alberto Valentini, Giorgio
author_sort	Bertoni, Alberto
collection	PubMed
description	BACKGROUND: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. RESULTS: We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ(2)-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures). CONCLUSION: The experimental results show that our model order selection methods are competitive with other state-of-the-art stability based algorithms and are able to detect multiple levels of structure underlying both synthetic and gene expression data.
format	Text
id	pubmed-1892076
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18920762007-06-15 Model order selection for bio-molecular data clustering Bertoni, Alberto Valentini, Giorgio BMC Bioinformatics Research BACKGROUND: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. RESULTS: We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ(2)-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures). CONCLUSION: The experimental results show that our model order selection methods are competitive with other state-of-the-art stability based algorithms and are able to detect multiple levels of structure underlying both synthetic and gene expression data. BioMed Central 2007-05-03 /pmc/articles/PMC1892076/ /pubmed/17493256 http://dx.doi.org/10.1186/1471-2105-8-S2-S7 Text en Copyright © 2007 Bertoni and Valentini; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Bertoni, Alberto Valentini, Giorgio Model order selection for bio-molecular data clustering
title	Model order selection for bio-molecular data clustering
title_full	Model order selection for bio-molecular data clustering
title_fullStr	Model order selection for bio-molecular data clustering
title_full_unstemmed	Model order selection for bio-molecular data clustering
title_short	Model order selection for bio-molecular data clustering
title_sort	model order selection for bio-molecular data clustering
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892076/ https://www.ncbi.nlm.nih.gov/pubmed/17493256 http://dx.doi.org/10.1186/1471-2105-8-S2-S7
work_keys_str_mv	AT bertonialberto modelorderselectionforbiomoleculardataclustering AT valentinigiorgio modelorderselectionforbiomoleculardataclustering

Model order selection for bio-molecular data clustering

Ejemplares similares