Cargando…

Top-Down Clustering for Protein Subfamily Identification

We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein...

Descripción completa

Detalles Bibliográficos
Autores principales: Costa, Eduardo P., Vens, Celine, Blockeel, Hendrik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Libertas Academica 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3653887/
https://www.ncbi.nlm.nih.gov/pubmed/23700359
http://dx.doi.org/10.4137/EBO.S11609
_version_ 1782269468709224448
author Costa, Eduardo P.
Vens, Celine
Blockeel, Hendrik
author_facet Costa, Eduardo P.
Vens, Celine
Blockeel, Hendrik
author_sort Costa, Eduardo P.
collection PubMed
description We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models.
format Online
Article
Text
id pubmed-3653887
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Libertas Academica
record_format MEDLINE/PubMed
spelling pubmed-36538872013-05-22 Top-Down Clustering for Protein Subfamily Identification Costa, Eduardo P. Vens, Celine Blockeel, Hendrik Evol Bioinform Online Original Research We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models. Libertas Academica 2013-05-06 /pmc/articles/PMC3653887/ /pubmed/23700359 http://dx.doi.org/10.4137/EBO.S11609 Text en © 2013 the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article published under the Creative Commons CC-BY-NC 3.0 license.
spellingShingle Original Research
Costa, Eduardo P.
Vens, Celine
Blockeel, Hendrik
Top-Down Clustering for Protein Subfamily Identification
title Top-Down Clustering for Protein Subfamily Identification
title_full Top-Down Clustering for Protein Subfamily Identification
title_fullStr Top-Down Clustering for Protein Subfamily Identification
title_full_unstemmed Top-Down Clustering for Protein Subfamily Identification
title_short Top-Down Clustering for Protein Subfamily Identification
title_sort top-down clustering for protein subfamily identification
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3653887/
https://www.ncbi.nlm.nih.gov/pubmed/23700359
http://dx.doi.org/10.4137/EBO.S11609
work_keys_str_mv AT costaeduardop topdownclusteringforproteinsubfamilyidentification
AT vensceline topdownclusteringforproteinsubfamilyidentification
AT blockeelhendrik topdownclusteringforproteinsubfamilyidentification