Cargando…
Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the on...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2008
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375056/ https://www.ncbi.nlm.nih.gov/pubmed/18493326 http://dx.doi.org/10.1371/journal.pone.0002247 |
_version_ | 1782154567121633280 |
---|---|
author | Varshavsky, Roy Horn, David Linial, Michal |
author_facet | Varshavsky, Roy Horn, David Linial, Michal |
author_sort | Varshavsky, Roy |
collection | PubMed |
description | BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. METHODOLOGY/PRINCIPAL FINDINGS: We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. CONCLUSIONS: Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations. |
format | Text |
id | pubmed-2375056 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2008 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-23750562008-05-21 Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data Varshavsky, Roy Horn, David Linial, Michal PLoS One Research Article BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. METHODOLOGY/PRINCIPAL FINDINGS: We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. CONCLUSIONS: Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations. Public Library of Science 2008-05-21 /pmc/articles/PMC2375056/ /pubmed/18493326 http://dx.doi.org/10.1371/journal.pone.0002247 Text en Varshavsky et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Varshavsky, Roy Horn, David Linial, Michal Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data |
title | Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data |
title_full | Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data |
title_fullStr | Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data |
title_full_unstemmed | Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data |
title_short | Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data |
title_sort | global considerations in hierarchical clustering reveal meaningful patterns in data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375056/ https://www.ncbi.nlm.nih.gov/pubmed/18493326 http://dx.doi.org/10.1371/journal.pone.0002247 |
work_keys_str_mv | AT varshavskyroy globalconsiderationsinhierarchicalclusteringrevealmeaningfulpatternsindata AT horndavid globalconsiderationsinhierarchicalclusteringrevealmeaningfulpatternsindata AT linialmichal globalconsiderationsinhierarchicalclusteringrevealmeaningfulpatternsindata |