Cargando…

Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters

Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tellaroli, Paola, Bazzi, Marco, Donato, Michele, Brazzale, Alessandra R., Drăghici, Sorin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4807765/ https://www.ncbi.nlm.nih.gov/pubmed/27015427 http://dx.doi.org/10.1371/journal.pone.0152333

_version_	1782423417170952192
author	Tellaroli, Paola Bazzi, Marco Donato, Michele Brazzale, Alessandra R. Drăghici, Sorin
author_facet	Tellaroli, Paola Bazzi, Marco Donato, Michele Brazzale, Alessandra R. Drăghici, Sorin
author_sort	Tellaroli, Paola
collection	PubMed
description	Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.
format	Online Article Text
id	pubmed-4807765
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-48077652016-04-05 Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters Tellaroli, Paola Bazzi, Marco Donato, Michele Brazzale, Alessandra R. Drăghici, Sorin PLoS One Research Article Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository. Public Library of Science 2016-03-25 /pmc/articles/PMC4807765/ /pubmed/27015427 http://dx.doi.org/10.1371/journal.pone.0152333 Text en © 2016 Tellaroli et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Tellaroli, Paola Bazzi, Marco Donato, Michele Brazzale, Alessandra R. Drăghici, Sorin Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters
title	Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters
title_full	Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters
title_fullStr	Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters
title_full_unstemmed	Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters
title_short	Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters
title_sort	cross-clustering: a partial clustering algorithm with automatic estimation of the number of clusters
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4807765/ https://www.ncbi.nlm.nih.gov/pubmed/27015427 http://dx.doi.org/10.1371/journal.pone.0152333
work_keys_str_mv	AT tellarolipaola crossclusteringapartialclusteringalgorithmwithautomaticestimationofthenumberofclusters AT bazzimarco crossclusteringapartialclusteringalgorithmwithautomaticestimationofthenumberofclusters AT donatomichele crossclusteringapartialclusteringalgorithmwithautomaticestimationofthenumberofclusters AT brazzalealessandrar crossclusteringapartialclusteringalgorithmwithautomaticestimationofthenumberofclusters AT draghicisorin crossclusteringapartialclusteringalgorithmwithautomaticestimationofthenumberofclusters

Cross-Clustering: A Partial Clustering Algorithm with Automatic Estimation of the Number of Clusters

Ejemplares similares