Cargando…
A data-driven approach to estimating the number of clusters in hierarchical clustering
DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few autom...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
F1000Research
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373427/ https://www.ncbi.nlm.nih.gov/pubmed/28408972 http://dx.doi.org/10.12688/f1000research.10103.1 |
_version_ | 1782518774041149440 |
---|---|
author | Zambelli, Antoine E. |
author_facet | Zambelli, Antoine E. |
author_sort | Zambelli, Antoine E. |
collection | PubMed |
description | DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms. |
format | Online Article Text |
id | pubmed-5373427 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | F1000Research |
record_format | MEDLINE/PubMed |
spelling | pubmed-53734272017-04-12 A data-driven approach to estimating the number of clusters in hierarchical clustering Zambelli, Antoine E. F1000Res Method Article DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms. F1000Research 2016-12-01 /pmc/articles/PMC5373427/ /pubmed/28408972 http://dx.doi.org/10.12688/f1000research.10103.1 Text en Copyright: © 2016 Zambelli AE http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Method Article Zambelli, Antoine E. A data-driven approach to estimating the number of clusters in hierarchical clustering |
title | A data-driven approach to estimating the number of clusters in hierarchical clustering |
title_full | A data-driven approach to estimating the number of clusters in hierarchical clustering |
title_fullStr | A data-driven approach to estimating the number of clusters in hierarchical clustering |
title_full_unstemmed | A data-driven approach to estimating the number of clusters in hierarchical clustering |
title_short | A data-driven approach to estimating the number of clusters in hierarchical clustering |
title_sort | data-driven approach to estimating the number of clusters in hierarchical clustering |
topic | Method Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5373427/ https://www.ncbi.nlm.nih.gov/pubmed/28408972 http://dx.doi.org/10.12688/f1000research.10103.1 |
work_keys_str_mv | AT zambelliantoinee adatadrivenapproachtoestimatingthenumberofclustersinhierarchicalclustering AT zambelliantoinee datadrivenapproachtoestimatingthenumberofclustersinhierarchicalclustering |