Cargando…

Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data

BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the on...

Descripción completa

Detalles Bibliográficos
Autores principales: Varshavsky, Roy, Horn, David, Linial, Michal
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375056/
https://www.ncbi.nlm.nih.gov/pubmed/18493326
http://dx.doi.org/10.1371/journal.pone.0002247
_version_ 1782154567121633280
author Varshavsky, Roy
Horn, David
Linial, Michal
author_facet Varshavsky, Roy
Horn, David
Linial, Michal
author_sort Varshavsky, Roy
collection PubMed
description BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. METHODOLOGY/PRINCIPAL FINDINGS: We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. CONCLUSIONS: Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations.
format Text
id pubmed-2375056
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-23750562008-05-21 Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data Varshavsky, Roy Horn, David Linial, Michal PLoS One Research Article BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. METHODOLOGY/PRINCIPAL FINDINGS: We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. CONCLUSIONS: Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations. Public Library of Science 2008-05-21 /pmc/articles/PMC2375056/ /pubmed/18493326 http://dx.doi.org/10.1371/journal.pone.0002247 Text en Varshavsky et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Varshavsky, Roy
Horn, David
Linial, Michal
Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
title Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
title_full Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
title_fullStr Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
title_full_unstemmed Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
title_short Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
title_sort global considerations in hierarchical clustering reveal meaningful patterns in data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375056/
https://www.ncbi.nlm.nih.gov/pubmed/18493326
http://dx.doi.org/10.1371/journal.pone.0002247
work_keys_str_mv AT varshavskyroy globalconsiderationsinhierarchicalclusteringrevealmeaningfulpatternsindata
AT horndavid globalconsiderationsinhierarchicalclusteringrevealmeaningfulpatternsindata
AT linialmichal globalconsiderationsinhierarchicalclusteringrevealmeaningfulpatternsindata