Cargando…

A data-driven approach to estimating the number of clusters in hierarchical clustering

DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few autom...

Descripción completa

Detalles Bibliográficos
Autor principal: Zambelli, Antoine E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: F1000Research 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373427/
https://www.ncbi.nlm.nih.gov/pubmed/28408972
http://dx.doi.org/10.12688/f1000research.10103.1
_version_ 1782518774041149440
author Zambelli, Antoine E.
author_facet Zambelli, Antoine E.
author_sort Zambelli, Antoine E.
collection PubMed
description DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms.
format Online
Article
Text
id pubmed-5373427
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher F1000Research
record_format MEDLINE/PubMed
spelling pubmed-53734272017-04-12 A data-driven approach to estimating the number of clusters in hierarchical clustering Zambelli, Antoine E. F1000Res Method Article DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms. F1000Research 2016-12-01 /pmc/articles/PMC5373427/ /pubmed/28408972 http://dx.doi.org/10.12688/f1000research.10103.1 Text en Copyright: © 2016 Zambelli AE http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Method Article
Zambelli, Antoine E.
A data-driven approach to estimating the number of clusters in hierarchical clustering
title A data-driven approach to estimating the number of clusters in hierarchical clustering
title_full A data-driven approach to estimating the number of clusters in hierarchical clustering
title_fullStr A data-driven approach to estimating the number of clusters in hierarchical clustering
title_full_unstemmed A data-driven approach to estimating the number of clusters in hierarchical clustering
title_short A data-driven approach to estimating the number of clusters in hierarchical clustering
title_sort data-driven approach to estimating the number of clusters in hierarchical clustering
topic Method Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373427/
https://www.ncbi.nlm.nih.gov/pubmed/28408972
http://dx.doi.org/10.12688/f1000research.10103.1
work_keys_str_mv AT zambelliantoinee adatadrivenapproachtoestimatingthenumberofclustersinhierarchicalclustering
AT zambelliantoinee datadrivenapproachtoestimatingthenumberofclustersinhierarchicalclustering