Cargando…

Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum

BACKGROUND: Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam, each domain family is represented by a curated multip...

Descripción completa

Detalles Bibliográficos
Autores principales: Terrapon, Nicolas, Gascuel, Olivier, Maréchal, Éric, Bréhélin, Laurent
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3434054/
https://www.ncbi.nlm.nih.gov/pubmed/22548871
http://dx.doi.org/10.1186/1471-2105-13-67
_version_ 1782242381998850048
author Terrapon, Nicolas
Gascuel, Olivier
Maréchal, Éric
Bréhélin, Laurent
author_facet Terrapon, Nicolas
Gascuel, Olivier
Maréchal, Éric
Bréhélin, Laurent
author_sort Terrapon, Nicolas
collection PubMed
description BACKGROUND: Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam, each domain family is represented by a curated multiple sequence alignment from which a profile HMM is built. In spite of their high specificity, HMMs may lack sensitivity when searching for domains in divergent organisms. This is particularly the case for species with a biased amino-acid composition, such as P. falciparum, the main causal agent of human malaria. In this context, fitting HMMs to the specificities of the target proteome can help identify additional domains. RESULTS: Using P. falciparum as an example, we compare approaches that have been proposed for this problem, and present two alternative methods. Because previous attempts strongly rely on known domain occurrences in the target species or its close relatives, they mainly improve the detection of domains which belong to already identified families. Our methods learn global correction rules that adjust amino-acid distributions associated with the match states of HMMs. These rules are applied to all match states of the whole HMM library, thus enabling the detection of domains from previously absent families. Additionally, we propose a procedure to estimate the proportion of false positives among the newly discovered domains. Starting with the Pfam standard library, we build several new libraries with the different HMM-fitting approaches. These libraries are first used to detect new domain occurrences with low E-values. Second, by applying the Co-Occurrence Domain Discovery (CODD) procedure we have recently proposed, the libraries are further used to identify likely occurrences among potential domains with higher E-values. CONCLUSION: We show that the new approaches allow identification of several domain families previously absent in the P. falciparum proteome and the Apicomplexa phylum, and identify many domains that are not detected by previous approaches. In terms of the number of new discovered domains, the new approaches outperform the previous ones when no close species are available or when they are used to identify likely occurrences among potential domains with high E-values. All predictions on P. falciparum have been integrated into a dedicated website which pools all known/new annotations of protein domains and functions for this organism. A software implementing the two proposed approaches is available at the same address: http://www.lirmm.fr/∼terrapon/HMMfit/
format Online
Article
Text
id pubmed-3434054
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34340542012-09-11 Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum Terrapon, Nicolas Gascuel, Olivier Maréchal, Éric Bréhélin, Laurent BMC Bioinformatics Methodology Article BACKGROUND: Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam, each domain family is represented by a curated multiple sequence alignment from which a profile HMM is built. In spite of their high specificity, HMMs may lack sensitivity when searching for domains in divergent organisms. This is particularly the case for species with a biased amino-acid composition, such as P. falciparum, the main causal agent of human malaria. In this context, fitting HMMs to the specificities of the target proteome can help identify additional domains. RESULTS: Using P. falciparum as an example, we compare approaches that have been proposed for this problem, and present two alternative methods. Because previous attempts strongly rely on known domain occurrences in the target species or its close relatives, they mainly improve the detection of domains which belong to already identified families. Our methods learn global correction rules that adjust amino-acid distributions associated with the match states of HMMs. These rules are applied to all match states of the whole HMM library, thus enabling the detection of domains from previously absent families. Additionally, we propose a procedure to estimate the proportion of false positives among the newly discovered domains. Starting with the Pfam standard library, we build several new libraries with the different HMM-fitting approaches. These libraries are first used to detect new domain occurrences with low E-values. Second, by applying the Co-Occurrence Domain Discovery (CODD) procedure we have recently proposed, the libraries are further used to identify likely occurrences among potential domains with higher E-values. CONCLUSION: We show that the new approaches allow identification of several domain families previously absent in the P. falciparum proteome and the Apicomplexa phylum, and identify many domains that are not detected by previous approaches. In terms of the number of new discovered domains, the new approaches outperform the previous ones when no close species are available or when they are used to identify likely occurrences among potential domains with high E-values. All predictions on P. falciparum have been integrated into a dedicated website which pools all known/new annotations of protein domains and functions for this organism. A software implementing the two proposed approaches is available at the same address: http://www.lirmm.fr/∼terrapon/HMMfit/ BioMed Central 2012-05-01 /pmc/articles/PMC3434054/ /pubmed/22548871 http://dx.doi.org/10.1186/1471-2105-13-67 Text en Copyright ©2012 Terrapon et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Terrapon, Nicolas
Gascuel, Olivier
Maréchal, Éric
Bréhélin, Laurent
Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum
title Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum
title_full Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum
title_fullStr Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum
title_full_unstemmed Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum
title_short Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum
title_sort fitting hidden markov models of protein domains to a target species: application to plasmodium falciparum
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3434054/
https://www.ncbi.nlm.nih.gov/pubmed/22548871
http://dx.doi.org/10.1186/1471-2105-13-67
work_keys_str_mv AT terraponnicolas fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoplasmodiumfalciparum
AT gascuelolivier fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoplasmodiumfalciparum
AT marechaleric fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoplasmodiumfalciparum
AT brehelinlaurent fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoplasmodiumfalciparum