Cargando…

Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds

BACKGROUND: As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with so...

Descripción completa

Detalles Bibliográficos
Autores principales: Sadreyev, Ruslan I, Grishin, Nick V
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1444916/
https://www.ncbi.nlm.nih.gov/pubmed/16549009
http://dx.doi.org/10.1186/1472-6807-6-6
_version_ 1782127348177436672
author Sadreyev, Ruslan I
Grishin, Nick V
author_facet Sadreyev, Ruslan I
Grishin, Nick V
author_sort Sadreyev, Ruslan I
collection PubMed
description BACKGROUND: As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains? RESULTS: To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database. CONCLUSION: The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as ~4000 and ~1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins.
format Text
id pubmed-1444916
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14449162006-04-24 Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds Sadreyev, Ruslan I Grishin, Nick V BMC Struct Biol Research Article BACKGROUND: As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains? RESULTS: To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database. CONCLUSION: The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as ~4000 and ~1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins. BioMed Central 2006-03-20 /pmc/articles/PMC1444916/ /pubmed/16549009 http://dx.doi.org/10.1186/1472-6807-6-6 Text en Copyright © 2006 Sadreyev and Grishin; licensee BioMed Central Ltd.
spellingShingle Research Article
Sadreyev, Ruslan I
Grishin, Nick V
Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds
title Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds
title_full Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds
title_fullStr Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds
title_full_unstemmed Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds
title_short Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds
title_sort exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1444916/
https://www.ncbi.nlm.nih.gov/pubmed/16549009
http://dx.doi.org/10.1186/1472-6807-6-6
work_keys_str_mv AT sadreyevruslani exploringdynamicsofproteinstructuredeterminationandhomologybasedpredictiontoestimatethenumberofsuperfamiliesandfolds
AT grishinnickv exploringdynamicsofproteinstructuredeterminationandhomologybasedpredictiontoestimatethenumberofsuperfamiliesandfolds