Cargando…

A data-driven approach to estimating the number of clusters in hierarchical clustering

DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few autom...

Descripción completa

Detalles Bibliográficos
Autor principal:	Zambelli, Antoine E.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	F1000Research 2016
Materias:	Method Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373427/ https://www.ncbi.nlm.nih.gov/pubmed/28408972 http://dx.doi.org/10.12688/f1000research.10103.1

_version_	1782518774041149440
author	Zambelli, Antoine E.
author_facet	Zambelli, Antoine E.
author_sort	Zambelli, Antoine E.
collection	PubMed
description	DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms.
format	Online Article Text
id	pubmed-5373427
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	F1000Research
record_format	MEDLINE/PubMed
spelling	pubmed-53734272017-04-12 A data-driven approach to estimating the number of clusters in hierarchical clustering Zambelli, Antoine E. F1000Res Method Article DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms. F1000Research 2016-12-01 /pmc/articles/PMC5373427/ /pubmed/28408972 http://dx.doi.org/10.12688/f1000research.10103.1 Text en Copyright: © 2016 Zambelli AE http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Method Article Zambelli, Antoine E. A data-driven approach to estimating the number of clusters in hierarchical clustering
title	A data-driven approach to estimating the number of clusters in hierarchical clustering
title_full	A data-driven approach to estimating the number of clusters in hierarchical clustering
title_fullStr	A data-driven approach to estimating the number of clusters in hierarchical clustering
title_full_unstemmed	A data-driven approach to estimating the number of clusters in hierarchical clustering
title_short	A data-driven approach to estimating the number of clusters in hierarchical clustering
title_sort	data-driven approach to estimating the number of clusters in hierarchical clustering
topic	Method Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373427/ https://www.ncbi.nlm.nih.gov/pubmed/28408972 http://dx.doi.org/10.12688/f1000research.10103.1
work_keys_str_mv	AT zambelliantoinee adatadrivenapproachtoestimatingthenumberofclustersinhierarchicalclustering AT zambelliantoinee datadrivenapproachtoestimatingthenumberofclustersinhierarchicalclustering

A data-driven approach to estimating the number of clusters in hierarchical clustering

Ejemplares similares