Cargando…

Fast approximate hierarchical clustering using similarity heuristics

BACKGROUND: Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. With ever-increasing data sizes this quadratic complexity pos...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kull, Meelis, Vilo, Jaak
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2561018/ https://www.ncbi.nlm.nih.gov/pubmed/18822115 http://dx.doi.org/10.1186/1756-0381-1-9

_version_	1782159700242989056
author	Kull, Meelis Vilo, Jaak
author_facet	Kull, Meelis Vilo, Jaak
author_sort	Kull, Meelis
collection	PubMed
description	BACKGROUND: Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. With ever-increasing data sizes this quadratic complexity poses problems that cannot be overcome by simply waiting for faster computers. RESULTS: We propose an approximate AHC algorithm HappieClust which can output a biologically meaningful clustering of a large dataset more than an order of magnitude faster than full AHC algorithms. The key to the algorithm is to limit the number of calculated pairwise distances to a carefully chosen subset of all possible distances. We choose distances using a similarity heuristic based on a small set of pivot objects. The heuristic efficiently finds pairs of similar objects and these help to mimic the greedy choices of full AHC. Quality of approximate AHC as compared to full AHC is studied with three measures. The first measure evaluates the global quality of the achieved clustering, while the second compares biological relevance using enrichment of biological functions in every subtree of the clusterings. The third measure studies how well the contents of subtrees are conserved between the clusterings. CONCLUSION: The HappieClust algorithm is well suited for large-scale gene expression visualization and analysis both on personal computers as well as public online web applications. The software is available from the URL
format	Text
id	pubmed-2561018
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-25610182008-10-07 Fast approximate hierarchical clustering using similarity heuristics Kull, Meelis Vilo, Jaak BioData Min Research BACKGROUND: Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. With ever-increasing data sizes this quadratic complexity poses problems that cannot be overcome by simply waiting for faster computers. RESULTS: We propose an approximate AHC algorithm HappieClust which can output a biologically meaningful clustering of a large dataset more than an order of magnitude faster than full AHC algorithms. The key to the algorithm is to limit the number of calculated pairwise distances to a carefully chosen subset of all possible distances. We choose distances using a similarity heuristic based on a small set of pivot objects. The heuristic efficiently finds pairs of similar objects and these help to mimic the greedy choices of full AHC. Quality of approximate AHC as compared to full AHC is studied with three measures. The first measure evaluates the global quality of the achieved clustering, while the second compares biological relevance using enrichment of biological functions in every subtree of the clusterings. The third measure studies how well the contents of subtrees are conserved between the clusterings. CONCLUSION: The HappieClust algorithm is well suited for large-scale gene expression visualization and analysis both on personal computers as well as public online web applications. The software is available from the URL BioMed Central 2008-09-22 /pmc/articles/PMC2561018/ /pubmed/18822115 http://dx.doi.org/10.1186/1756-0381-1-9 Text en Copyright © 2008 Kull and Vilo; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Kull, Meelis Vilo, Jaak Fast approximate hierarchical clustering using similarity heuristics
title	Fast approximate hierarchical clustering using similarity heuristics
title_full	Fast approximate hierarchical clustering using similarity heuristics
title_fullStr	Fast approximate hierarchical clustering using similarity heuristics
title_full_unstemmed	Fast approximate hierarchical clustering using similarity heuristics
title_short	Fast approximate hierarchical clustering using similarity heuristics
title_sort	fast approximate hierarchical clustering using similarity heuristics
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2561018/ https://www.ncbi.nlm.nih.gov/pubmed/18822115 http://dx.doi.org/10.1186/1756-0381-1-9
work_keys_str_mv	AT kullmeelis fastapproximatehierarchicalclusteringusingsimilarityheuristics AT vilojaak fastapproximatehierarchicalclusteringusingsimilarityheuristics

Fast approximate hierarchical clustering using similarity heuristics

Ejemplares similares