Cargando…

Entropy-driven partitioning of the hierarchical protein space

Motivation: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all inclu...

Descripción completa

Detalles Bibliográficos
Autores principales: Rappoport, Nadav, Stern, Amos, Linial, Nathan, Linial, Michal
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4147929/
https://www.ncbi.nlm.nih.gov/pubmed/25161256
http://dx.doi.org/10.1093/bioinformatics/btu478
_version_ 1782332540332277760
author Rappoport, Nadav
Stern, Amos
Linial, Nathan
Linial, Michal
author_facet Rappoport, Nadav
Stern, Amos
Linial, Nathan
Linial, Michal
author_sort Rappoport, Nadav
collection PubMed
description Motivation: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all included proteins, finding an optimal level of granularity for the purpose of focusing on protein functional groups remain elusive. Here, we ask whether knowledge-based annotations on protein families can support the automatic unsupervised methods for identifying high-quality protein families. We present a method that yields within the ProtoNet hierarchy an optimal partition of clusters, relative to manual annotation schemes. The method’s principle is to minimize the entropy-derived distance between annotation-based partitions and all available hierarchical partitions. We describe the best front (BF) partition of 2 478 328 proteins from UniRef50. Of 4 929 553 ProtoNet tree clusters, BF based on Pfam annotations contain 26 891 clusters. The high quality of the partition is validated by the close correspondence with the set of clusters that best describe thousands of keywords of Pfam. The BF is shown to be superior to naïve cut in the ProtoNet tree that yields a similar number of clusters. Finally, we used parameters intrinsic to the clustering process to enrich a priori the BF’s clusters. We present the entropy-based method’s benefit in overcoming the unavoidable limitations of nested clusters in ProtoNet. We suggest that this automatic information-based cluster selection can be useful for other large-scale annotation schemes, as well as for systematically testing and comparing putative families derived from alternative clustering methods. Availability and implementation: A catalog of BF clusters for thousands of Pfam keywords is provided at http://protonet.cs.huji.ac.il/bestFront/ Contact: michall@cc.huji.ac.il
format Online
Article
Text
id pubmed-4147929
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-41479292014-09-02 Entropy-driven partitioning of the hierarchical protein space Rappoport, Nadav Stern, Amos Linial, Nathan Linial, Michal Bioinformatics Eccb 2014 Proceedings Papers Committee Motivation: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all included proteins, finding an optimal level of granularity for the purpose of focusing on protein functional groups remain elusive. Here, we ask whether knowledge-based annotations on protein families can support the automatic unsupervised methods for identifying high-quality protein families. We present a method that yields within the ProtoNet hierarchy an optimal partition of clusters, relative to manual annotation schemes. The method’s principle is to minimize the entropy-derived distance between annotation-based partitions and all available hierarchical partitions. We describe the best front (BF) partition of 2 478 328 proteins from UniRef50. Of 4 929 553 ProtoNet tree clusters, BF based on Pfam annotations contain 26 891 clusters. The high quality of the partition is validated by the close correspondence with the set of clusters that best describe thousands of keywords of Pfam. The BF is shown to be superior to naïve cut in the ProtoNet tree that yields a similar number of clusters. Finally, we used parameters intrinsic to the clustering process to enrich a priori the BF’s clusters. We present the entropy-based method’s benefit in overcoming the unavoidable limitations of nested clusters in ProtoNet. We suggest that this automatic information-based cluster selection can be useful for other large-scale annotation schemes, as well as for systematically testing and comparing putative families derived from alternative clustering methods. Availability and implementation: A catalog of BF clusters for thousands of Pfam keywords is provided at http://protonet.cs.huji.ac.il/bestFront/ Contact: michall@cc.huji.ac.il Oxford University Press 2014-09-01 2014-08-22 /pmc/articles/PMC4147929/ /pubmed/25161256 http://dx.doi.org/10.1093/bioinformatics/btu478 Text en © The Author 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Eccb 2014 Proceedings Papers Committee
Rappoport, Nadav
Stern, Amos
Linial, Nathan
Linial, Michal
Entropy-driven partitioning of the hierarchical protein space
title Entropy-driven partitioning of the hierarchical protein space
title_full Entropy-driven partitioning of the hierarchical protein space
title_fullStr Entropy-driven partitioning of the hierarchical protein space
title_full_unstemmed Entropy-driven partitioning of the hierarchical protein space
title_short Entropy-driven partitioning of the hierarchical protein space
title_sort entropy-driven partitioning of the hierarchical protein space
topic Eccb 2014 Proceedings Papers Committee
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4147929/
https://www.ncbi.nlm.nih.gov/pubmed/25161256
http://dx.doi.org/10.1093/bioinformatics/btu478
work_keys_str_mv AT rappoportnadav entropydrivenpartitioningofthehierarchicalproteinspace
AT sternamos entropydrivenpartitioningofthehierarchicalproteinspace
AT linialnathan entropydrivenpartitioningofthehierarchicalproteinspace
AT linialmichal entropydrivenpartitioningofthehierarchicalproteinspace