Cargando…

EVEREST: automatic identification and classification of protein domains in all protein sequences

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for...

Descripción completa

Detalles Bibliográficos
Autores principales: Portugaly, Elon, Harel, Amir, Linial, Nathan, Linial, Michal
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533870/
https://www.ncbi.nlm.nih.gov/pubmed/16749920
http://dx.doi.org/10.1186/1471-2105-7-277
_version_ 1782129077659893760
author Portugaly, Elon
Harel, Amir
Linial, Nathan
Linial, Michal
author_facet Portugaly, Elon
Harel, Amir
Linial, Nathan
Linial, Michal
author_sort Portugaly, Elon
collection PubMed
description BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.
format Text
id pubmed-1533870
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15338702006-08-08 EVEREST: automatic identification and classification of protein domains in all protein sequences Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal BMC Bioinformatics Methodology Article BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site. BioMed Central 2006-06-02 /pmc/articles/PMC1533870/ /pubmed/16749920 http://dx.doi.org/10.1186/1471-2105-7-277 Text en Copyright © 2006 Portugaly et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Portugaly, Elon
Harel, Amir
Linial, Nathan
Linial, Michal
EVEREST: automatic identification and classification of protein domains in all protein sequences
title EVEREST: automatic identification and classification of protein domains in all protein sequences
title_full EVEREST: automatic identification and classification of protein domains in all protein sequences
title_fullStr EVEREST: automatic identification and classification of protein domains in all protein sequences
title_full_unstemmed EVEREST: automatic identification and classification of protein domains in all protein sequences
title_short EVEREST: automatic identification and classification of protein domains in all protein sequences
title_sort everest: automatic identification and classification of protein domains in all protein sequences
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533870/
https://www.ncbi.nlm.nih.gov/pubmed/16749920
http://dx.doi.org/10.1186/1471-2105-7-277
work_keys_str_mv AT portugalyelon everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences
AT harelamir everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences
AT linialnathan everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences
AT linialmichal everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences