Cargando…

How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

BACKGROUND: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are...

Descripción completa

Detalles Bibliográficos
Autores principales:	Leal, Wilmer, Llanos, Eugenio J., Restrepo, Guillermo, Suárez, Carlos F., Patarroyo, Manuel Elkin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727313/ https://www.ncbi.nlm.nih.gov/pubmed/26816532 http://dx.doi.org/10.1186/s13321-016-0114-x

_version_	1782411944082276352
author	Leal, Wilmer Llanos, Eugenio J. Restrepo, Guillermo Suárez, Carlos F. Patarroyo, Manuel Elkin
author_facet	Leal, Wilmer Llanos, Eugenio J. Restrepo, Guillermo Suárez, Carlos F. Patarroyo, Manuel Elkin
author_sort	Leal, Wilmer
collection	PubMed
description	BACKGROUND: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. RESULTS: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. CONCLUSIONS: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately. [Figure: see text]
format	Online Article Text
id	pubmed-4727313
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-47273132016-01-27 How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity Leal, Wilmer Llanos, Eugenio J. Restrepo, Guillermo Suárez, Carlos F. Patarroyo, Manuel Elkin J Cheminform Research Article BACKGROUND: Hierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies. RESULTS: Representing dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets. CONCLUSIONS: Our approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately. [Figure: see text] Springer International Publishing 2016-01-25 /pmc/articles/PMC4727313/ /pubmed/26816532 http://dx.doi.org/10.1186/s13321-016-0114-x Text en © Leal et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Leal, Wilmer Llanos, Eugenio J. Restrepo, Guillermo Suárez, Carlos F. Patarroyo, Manuel Elkin How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title	How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_full	How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_fullStr	How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_full_unstemmed	How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_short	How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity
title_sort	how frequently do clusters occur in hierarchical clustering analysis? a graph theoretical approach to studying ties in proximity
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727313/ https://www.ncbi.nlm.nih.gov/pubmed/26816532 http://dx.doi.org/10.1186/s13321-016-0114-x
work_keys_str_mv	AT lealwilmer howfrequentlydoclustersoccurinhierarchicalclusteringanalysisagraphtheoreticalapproachtostudyingtiesinproximity AT llanoseugenioj howfrequentlydoclustersoccurinhierarchicalclusteringanalysisagraphtheoreticalapproachtostudyingtiesinproximity AT restrepoguillermo howfrequentlydoclustersoccurinhierarchicalclusteringanalysisagraphtheoreticalapproachtostudyingtiesinproximity AT suarezcarlosf howfrequentlydoclustersoccurinhierarchicalclusteringanalysisagraphtheoreticalapproachtostudyingtiesinproximity AT patarroyomanuelelkin howfrequentlydoclustersoccurinhierarchicalclusteringanalysisagraphtheoreticalapproachtostudyingtiesinproximity

How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

Ejemplares similares