Cargando…

A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions

BACKGROUND: Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function pre...

Descripción completa

Detalles Bibliográficos
Autores principales:	Louie, Brenton, Higdon, Roger, Kolker, Eugene
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2009
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2760442/ https://www.ncbi.nlm.nih.gov/pubmed/19844580 http://dx.doi.org/10.1371/journal.pone.0007546

_version_	1782172741670010880
author	Louie, Brenton Higdon, Roger Kolker, Eugene
author_facet	Louie, Brenton Higdon, Roger Kolker, Eugene
author_sort	Louie, Brenton
collection	PubMed
description	BACKGROUND: Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity. METHODOLOGY: Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity. SIGNIFICANCE: Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e(−62), non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e(−05), NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.
format	Text
id	pubmed-2760442
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-27604422009-10-21 A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions Louie, Brenton Higdon, Roger Kolker, Eugene PLoS One Research Article BACKGROUND: Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity. METHODOLOGY: Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity. SIGNIFICANCE: Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e(−62), non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e(−05), NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function. Public Library of Science 2009-10-21 /pmc/articles/PMC2760442/ /pubmed/19844580 http://dx.doi.org/10.1371/journal.pone.0007546 Text en Louie et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Louie, Brenton Higdon, Roger Kolker, Eugene A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions
title	A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions
title_full	A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions
title_fullStr	A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions
title_full_unstemmed	A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions
title_short	A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions
title_sort	statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2760442/ https://www.ncbi.nlm.nih.gov/pubmed/19844580 http://dx.doi.org/10.1371/journal.pone.0007546
work_keys_str_mv	AT louiebrenton astatisticalmodelofproteinsequencesimilarityandfunctionsimilarityrevealsoverlyspecificfunctionpredictions AT higdonroger astatisticalmodelofproteinsequencesimilarityandfunctionsimilarityrevealsoverlyspecificfunctionpredictions AT kolkereugene astatisticalmodelofproteinsequencesimilarityandfunctionsimilarityrevealsoverlyspecificfunctionpredictions AT louiebrenton statisticalmodelofproteinsequencesimilarityandfunctionsimilarityrevealsoverlyspecificfunctionpredictions AT higdonroger statisticalmodelofproteinsequencesimilarityandfunctionsimilarityrevealsoverlyspecificfunctionpredictions AT kolkereugene statisticalmodelofproteinsequencesimilarityandfunctionsimilarityrevealsoverlyspecificfunctionpredictions

A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions

Ejemplares similares