Cargando…

Protein function prediction by massive integration of evolutionary analyses and multiple data sources

BACKGROUND: Accurate protein function annotation is a severe bottleneck when utilizing the deluge of high-throughput, next generation sequencing data. Keeping database annotations up-to-date has become a major scientific challenge that requires the development of reliable automatic predictors of pro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cozzetto, Domenico, Buchan, Daniel WA, Bryson, Kevin, Jones, David T
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3584902/ https://www.ncbi.nlm.nih.gov/pubmed/23514099 http://dx.doi.org/10.1186/1471-2105-14-S3-S1

_version_	1782261073419698176
author	Cozzetto, Domenico Buchan, Daniel WA Bryson, Kevin Jones, David T
author_facet	Cozzetto, Domenico Buchan, Daniel WA Bryson, Kevin Jones, David T
author_sort	Cozzetto, Domenico
collection	PubMed
description	BACKGROUND: Accurate protein function annotation is a severe bottleneck when utilizing the deluge of high-throughput, next generation sequencing data. Keeping database annotations up-to-date has become a major scientific challenge that requires the development of reliable automatic predictors of protein function. The CAFA experiment provided a unique opportunity to undertake comprehensive 'blind testing' of many diverse approaches for automated function prediction. We report on the methodology we used for this challenge and on the lessons we learnt. METHODS: Our method integrates into a single framework a wide variety of biological information sources, encompassing sequence, gene expression and protein-protein interaction data, as well as annotations in UniProt entries. The methodology transfers functional categories based on the results from complementary homology-based and feature-based analyses. We generated the final molecular function and biological process assignments by combining the initial predictions in a probabilistic manner, which takes into account the Gene Ontology hierarchical structure. RESULTS: We propose a novel scoring function called COmbined Graph-Information Content similarity (COGIC) score for the comparison of predicted functional categories and benchmark data. We demonstrate that our integrative approach provides increased scope and accuracy over both the component methods and the naïve predictors. In line with previous studies, we find that molecular function predictions are more accurate than biological process assignments. CONCLUSIONS: Overall, the results indicate that there is considerable room for improvement in the field. It still remains for the community to invest a great deal of effort to make automated function prediction a useful and routine component in the toolbox of life scientists. As already witnessed in other areas, community-wide blind testing experiments will be pivotal in establishing standards for the evaluation of prediction accuracy, in fostering advancements and new ideas, and ultimately in recording progress.
format	Online Article Text
id	pubmed-3584902
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-35849022013-03-11 Protein function prediction by massive integration of evolutionary analyses and multiple data sources Cozzetto, Domenico Buchan, Daniel WA Bryson, Kevin Jones, David T BMC Bioinformatics Proceedings BACKGROUND: Accurate protein function annotation is a severe bottleneck when utilizing the deluge of high-throughput, next generation sequencing data. Keeping database annotations up-to-date has become a major scientific challenge that requires the development of reliable automatic predictors of protein function. The CAFA experiment provided a unique opportunity to undertake comprehensive 'blind testing' of many diverse approaches for automated function prediction. We report on the methodology we used for this challenge and on the lessons we learnt. METHODS: Our method integrates into a single framework a wide variety of biological information sources, encompassing sequence, gene expression and protein-protein interaction data, as well as annotations in UniProt entries. The methodology transfers functional categories based on the results from complementary homology-based and feature-based analyses. We generated the final molecular function and biological process assignments by combining the initial predictions in a probabilistic manner, which takes into account the Gene Ontology hierarchical structure. RESULTS: We propose a novel scoring function called COmbined Graph-Information Content similarity (COGIC) score for the comparison of predicted functional categories and benchmark data. We demonstrate that our integrative approach provides increased scope and accuracy over both the component methods and the naïve predictors. In line with previous studies, we find that molecular function predictions are more accurate than biological process assignments. CONCLUSIONS: Overall, the results indicate that there is considerable room for improvement in the field. It still remains for the community to invest a great deal of effort to make automated function prediction a useful and routine component in the toolbox of life scientists. As already witnessed in other areas, community-wide blind testing experiments will be pivotal in establishing standards for the evaluation of prediction accuracy, in fostering advancements and new ideas, and ultimately in recording progress. BioMed Central 2013-02-28 /pmc/articles/PMC3584902/ /pubmed/23514099 http://dx.doi.org/10.1186/1471-2105-14-S3-S1 Text en Copyright ©2013 Cozzetto et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Cozzetto, Domenico Buchan, Daniel WA Bryson, Kevin Jones, David T Protein function prediction by massive integration of evolutionary analyses and multiple data sources
title	Protein function prediction by massive integration of evolutionary analyses and multiple data sources
title_full	Protein function prediction by massive integration of evolutionary analyses and multiple data sources
title_fullStr	Protein function prediction by massive integration of evolutionary analyses and multiple data sources
title_full_unstemmed	Protein function prediction by massive integration of evolutionary analyses and multiple data sources
title_short	Protein function prediction by massive integration of evolutionary analyses and multiple data sources
title_sort	protein function prediction by massive integration of evolutionary analyses and multiple data sources
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3584902/ https://www.ncbi.nlm.nih.gov/pubmed/23514099 http://dx.doi.org/10.1186/1471-2105-14-S3-S1
work_keys_str_mv	AT cozzettodomenico proteinfunctionpredictionbymassiveintegrationofevolutionaryanalysesandmultipledatasources AT buchandanielwa proteinfunctionpredictionbymassiveintegrationofevolutionaryanalysesandmultipledatasources AT brysonkevin proteinfunctionpredictionbymassiveintegrationofevolutionaryanalysesandmultipledatasources AT jonesdavidt proteinfunctionpredictionbymassiveintegrationofevolutionaryanalysesandmultipledatasources

Protein function prediction by massive integration of evolutionary analyses and multiple data sources

Ejemplares similares