Cargando…
Top-Down Clustering for Protein Subfamily Identification
We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Libertas Academica
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3653887/ https://www.ncbi.nlm.nih.gov/pubmed/23700359 http://dx.doi.org/10.4137/EBO.S11609 |
_version_ | 1782269468709224448 |
---|---|
author | Costa, Eduardo P. Vens, Celine Blockeel, Hendrik |
author_facet | Costa, Eduardo P. Vens, Celine Blockeel, Hendrik |
author_sort | Costa, Eduardo P. |
collection | PubMed |
description | We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models. |
format | Online Article Text |
id | pubmed-3653887 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Libertas Academica |
record_format | MEDLINE/PubMed |
spelling | pubmed-36538872013-05-22 Top-Down Clustering for Protein Subfamily Identification Costa, Eduardo P. Vens, Celine Blockeel, Hendrik Evol Bioinform Online Original Research We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models. Libertas Academica 2013-05-06 /pmc/articles/PMC3653887/ /pubmed/23700359 http://dx.doi.org/10.4137/EBO.S11609 Text en © 2013 the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article published under the Creative Commons CC-BY-NC 3.0 license. |
spellingShingle | Original Research Costa, Eduardo P. Vens, Celine Blockeel, Hendrik Top-Down Clustering for Protein Subfamily Identification |
title | Top-Down Clustering for Protein Subfamily Identification |
title_full | Top-Down Clustering for Protein Subfamily Identification |
title_fullStr | Top-Down Clustering for Protein Subfamily Identification |
title_full_unstemmed | Top-Down Clustering for Protein Subfamily Identification |
title_short | Top-Down Clustering for Protein Subfamily Identification |
title_sort | top-down clustering for protein subfamily identification |
topic | Original Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3653887/ https://www.ncbi.nlm.nih.gov/pubmed/23700359 http://dx.doi.org/10.4137/EBO.S11609 |
work_keys_str_mv | AT costaeduardop topdownclusteringforproteinsubfamilyidentification AT vensceline topdownclusteringforproteinsubfamilyidentification AT blockeelhendrik topdownclusteringforproteinsubfamilyidentification |