Cargando…
EVEREST: automatic identification and classification of protein domains in all protein sequences
BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533870/ https://www.ncbi.nlm.nih.gov/pubmed/16749920 http://dx.doi.org/10.1186/1471-2105-7-277 |
_version_ | 1782129077659893760 |
---|---|
author | Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal |
author_facet | Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal |
author_sort | Portugaly, Elon |
collection | PubMed |
description | BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site. |
format | Text |
id | pubmed-1533870 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-15338702006-08-08 EVEREST: automatic identification and classification of protein domains in all protein sequences Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal BMC Bioinformatics Methodology Article BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site. BioMed Central 2006-06-02 /pmc/articles/PMC1533870/ /pubmed/16749920 http://dx.doi.org/10.1186/1471-2105-7-277 Text en Copyright © 2006 Portugaly et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal EVEREST: automatic identification and classification of protein domains in all protein sequences |
title | EVEREST: automatic identification and classification of protein domains in all protein sequences |
title_full | EVEREST: automatic identification and classification of protein domains in all protein sequences |
title_fullStr | EVEREST: automatic identification and classification of protein domains in all protein sequences |
title_full_unstemmed | EVEREST: automatic identification and classification of protein domains in all protein sequences |
title_short | EVEREST: automatic identification and classification of protein domains in all protein sequences |
title_sort | everest: automatic identification and classification of protein domains in all protein sequences |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533870/ https://www.ncbi.nlm.nih.gov/pubmed/16749920 http://dx.doi.org/10.1186/1471-2105-7-277 |
work_keys_str_mv | AT portugalyelon everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences AT harelamir everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences AT linialnathan everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences AT linialmichal everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences |