Cargando…

Semi-supervised adaptive-height snipping of the hierarchical clustering tree

BACKGROUND: In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each conti...

Descripción completa

Detalles Bibliográficos
Autores principales: Obulkasim, Askar, Meijer, Gerrit A, van de Wiel, Mark A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302100/
https://www.ncbi.nlm.nih.gov/pubmed/25592847
http://dx.doi.org/10.1186/s12859-014-0448-1
_version_ 1782353737072771072
author Obulkasim, Askar
Meijer, Gerrit A
van de Wiel, Mark A
author_facet Obulkasim, Askar
Meijer, Gerrit A
van de Wiel, Mark A
author_sort Obulkasim, Askar
collection PubMed
description BACKGROUND: In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information. RESULTS: We develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches. CONCLUSIONS: The guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information. The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0448-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4302100
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43021002015-01-22 Semi-supervised adaptive-height snipping of the hierarchical clustering tree Obulkasim, Askar Meijer, Gerrit A van de Wiel, Mark A BMC Bioinformatics Research Article BACKGROUND: In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information. RESULTS: We develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches. CONCLUSIONS: The guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information. The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0448-1) contains supplementary material, which is available to authorized users. BioMed Central 2015-01-16 /pmc/articles/PMC4302100/ /pubmed/25592847 http://dx.doi.org/10.1186/s12859-014-0448-1 Text en © Obulkasim et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Obulkasim, Askar
Meijer, Gerrit A
van de Wiel, Mark A
Semi-supervised adaptive-height snipping of the hierarchical clustering tree
title Semi-supervised adaptive-height snipping of the hierarchical clustering tree
title_full Semi-supervised adaptive-height snipping of the hierarchical clustering tree
title_fullStr Semi-supervised adaptive-height snipping of the hierarchical clustering tree
title_full_unstemmed Semi-supervised adaptive-height snipping of the hierarchical clustering tree
title_short Semi-supervised adaptive-height snipping of the hierarchical clustering tree
title_sort semi-supervised adaptive-height snipping of the hierarchical clustering tree
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302100/
https://www.ncbi.nlm.nih.gov/pubmed/25592847
http://dx.doi.org/10.1186/s12859-014-0448-1
work_keys_str_mv AT obulkasimaskar semisupervisedadaptiveheightsnippingofthehierarchicalclusteringtree
AT meijergerrita semisupervisedadaptiveheightsnippingofthehierarchicalclusteringtree
AT vandewielmarka semisupervisedadaptiveheightsnippingofthehierarchicalclusteringtree