Cargando…

Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence

Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental quest...

Descripción completa

Detalles Bibliográficos
Autores principales: Bernardes, Juliana, Zaverucha, Gerson, Vaquero, Catherine, Carbone, Alessandra
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4966962/
https://www.ncbi.nlm.nih.gov/pubmed/27472895
http://dx.doi.org/10.1371/journal.pcbi.1005038
_version_ 1782445463463526400
author Bernardes, Juliana
Zaverucha, Gerson
Vaquero, Catherine
Carbone, Alessandra
author_facet Bernardes, Juliana
Zaverucha, Gerson
Vaquero, Catherine
Carbone, Alessandra
author_sort Bernardes, Juliana
collection PubMed
description Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.
format Online
Article
Text
id pubmed-4966962
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-49669622016-08-18 Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence Bernardes, Juliana Zaverucha, Gerson Vaquero, Catherine Carbone, Alessandra PLoS Comput Biol Research Article Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE. Public Library of Science 2016-07-29 /pmc/articles/PMC4966962/ /pubmed/27472895 http://dx.doi.org/10.1371/journal.pcbi.1005038 Text en © 2016 Bernardes et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Bernardes, Juliana
Zaverucha, Gerson
Vaquero, Catherine
Carbone, Alessandra
Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence
title Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence
title_full Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence
title_fullStr Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence
title_full_unstemmed Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence
title_short Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence
title_sort improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4966962/
https://www.ncbi.nlm.nih.gov/pubmed/27472895
http://dx.doi.org/10.1371/journal.pcbi.1005038
work_keys_str_mv AT bernardesjuliana improvementinproteindomainidentificationisreachedbybreakingconsensuswiththeagreementofmanyprofilesanddomaincooccurrence
AT zaveruchagerson improvementinproteindomainidentificationisreachedbybreakingconsensuswiththeagreementofmanyprofilesanddomaincooccurrence
AT vaquerocatherine improvementinproteindomainidentificationisreachedbybreakingconsensuswiththeagreementofmanyprofilesanddomaincooccurrence
AT carbonealessandra improvementinproteindomainidentificationisreachedbybreakingconsensuswiththeagreementofmanyprofilesanddomaincooccurrence