Cargando…

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Appl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Loewenstein, Yaniv, Portugaly, Elon, Fromer, Menachem, Linial, Michal
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2008
Materias:	Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2718652/ https://www.ncbi.nlm.nih.gov/pubmed/18586742 http://dx.doi.org/10.1093/bioinformatics/btn174

_version_	1782170008620630016
author	Loewenstein, Yaniv Portugaly, Elon Fromer, Menachem Linial, Michal
author_facet	Loewenstein, Yaniv Portugaly, Elon Fromer, Menachem Linial, Michal
author_sort	Loewenstein, Yaniv
collection	PubMed
description	Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il
format	Text
id	pubmed-2718652
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-27186522009-07-31 Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space Loewenstein, Yaniv Portugaly, Elon Fromer, Menachem Linial, Michal Bioinformatics Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il Oxford University Press 2008-07-01 /pmc/articles/PMC2718652/ /pubmed/18586742 http://dx.doi.org/10.1093/bioinformatics/btn174 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto Loewenstein, Yaniv Portugaly, Elon Fromer, Menachem Linial, Michal Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
title	Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
title_full	Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
title_fullStr	Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
title_full_unstemmed	Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
title_short	Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
title_sort	efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
topic	Ismb 2008 Conference Proceedings 19–23 July 2008, Toronto
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2718652/ https://www.ncbi.nlm.nih.gov/pubmed/18586742 http://dx.doi.org/10.1093/bioinformatics/btn174
work_keys_str_mv	AT loewensteinyaniv efficientalgorithmsforaccuratehierarchicalclusteringofhugedatasetstacklingtheentireproteinspace AT portugalyelon efficientalgorithmsforaccuratehierarchicalclusteringofhugedatasetstacklingtheentireproteinspace AT fromermenachem efficientalgorithmsforaccuratehierarchicalclusteringofhugedatasetstacklingtheentireproteinspace AT linialmichal efficientalgorithmsforaccuratehierarchicalclusteringofhugedatasetstacklingtheentireproteinspace

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

Ejemplares similares