Cargando…

Improving protein function prediction methods with integrated literature data

BACKGROUND: Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gabow, Aaron P, Leach, Sonia M, Baumgartner, William A, Hunter, Lawrence E, Goldberg, Debra S
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375131/ https://www.ncbi.nlm.nih.gov/pubmed/18412966 http://dx.doi.org/10.1186/1471-2105-9-198

_version_	1782154584676892672
author	Gabow, Aaron P Leach, Sonia M Baumgartner, William A Hunter, Lawrence E Goldberg, Debra S
author_facet	Gabow, Aaron P Leach, Sonia M Baumgartner, William A Hunter, Lawrence E Goldberg, Debra S
author_sort	Gabow, Aaron P
collection	PubMed
description	BACKGROUND: Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity. RESULTS: We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial. CONCLUSION: Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.
format	Text
id	pubmed-2375131
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-23751312008-05-12 Improving protein function prediction methods with integrated literature data Gabow, Aaron P Leach, Sonia M Baumgartner, William A Hunter, Lawrence E Goldberg, Debra S BMC Bioinformatics Research Article BACKGROUND: Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity. RESULTS: We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial. CONCLUSION: Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit. BioMed Central 2008-04-15 /pmc/articles/PMC2375131/ /pubmed/18412966 http://dx.doi.org/10.1186/1471-2105-9-198 Text en Copyright © 2008 Gabow et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Gabow, Aaron P Leach, Sonia M Baumgartner, William A Hunter, Lawrence E Goldberg, Debra S Improving protein function prediction methods with integrated literature data
title	Improving protein function prediction methods with integrated literature data
title_full	Improving protein function prediction methods with integrated literature data
title_fullStr	Improving protein function prediction methods with integrated literature data
title_full_unstemmed	Improving protein function prediction methods with integrated literature data
title_short	Improving protein function prediction methods with integrated literature data
title_sort	improving protein function prediction methods with integrated literature data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2375131/ https://www.ncbi.nlm.nih.gov/pubmed/18412966 http://dx.doi.org/10.1186/1471-2105-9-198
work_keys_str_mv	AT gabowaaronp improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT leachsoniam improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT baumgartnerwilliama improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT hunterlawrencee improvingproteinfunctionpredictionmethodswithintegratedliteraturedata AT goldbergdebras improvingproteinfunctionpredictionmethodswithintegratedliteraturedata

Improving protein function prediction methods with integrated literature data

Ejemplares similares