Cargando…

EVEREST: automatic identification and classification of protein domains in all protein sequences

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for...

Descripción completa

Detalles Bibliográficos
Autores principales:	Portugaly, Elon, Harel, Amir, Linial, Nathan, Linial, Michal
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533870/ https://www.ncbi.nlm.nih.gov/pubmed/16749920 http://dx.doi.org/10.1186/1471-2105-7-277

_version_	1782129077659893760
author	Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal
author_facet	Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal
author_sort	Portugaly, Elon
collection	PubMed
description	BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.
format	Text
id	pubmed-1533870
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15338702006-08-08 EVEREST: automatic identification and classification of protein domains in all protein sequences Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal BMC Bioinformatics Methodology Article BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site. BioMed Central 2006-06-02 /pmc/articles/PMC1533870/ /pubmed/16749920 http://dx.doi.org/10.1186/1471-2105-7-277 Text en Copyright © 2006 Portugaly et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Portugaly, Elon Harel, Amir Linial, Nathan Linial, Michal EVEREST: automatic identification and classification of protein domains in all protein sequences
title	EVEREST: automatic identification and classification of protein domains in all protein sequences
title_full	EVEREST: automatic identification and classification of protein domains in all protein sequences
title_fullStr	EVEREST: automatic identification and classification of protein domains in all protein sequences
title_full_unstemmed	EVEREST: automatic identification and classification of protein domains in all protein sequences
title_short	EVEREST: automatic identification and classification of protein domains in all protein sequences
title_sort	everest: automatic identification and classification of protein domains in all protein sequences
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1533870/ https://www.ncbi.nlm.nih.gov/pubmed/16749920 http://dx.doi.org/10.1186/1471-2105-7-277
work_keys_str_mv	AT portugalyelon everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences AT harelamir everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences AT linialnathan everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences AT linialmichal everestautomaticidentificationandclassificationofproteindomainsinallproteinsequences

EVEREST: automatic identification and classification of protein domains in all protein sequences

Ejemplares similares