Cargando…

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

BACKGROUND: Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annot...

Descripción completa

Detalles Bibliográficos
Autores principales:	Khan, Ishita K., Wei, Qing, Chapman, Samuel, KC, Dukka B., Kihara, Daisuke
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4570625/ https://www.ncbi.nlm.nih.gov/pubmed/26380077 http://dx.doi.org/10.1186/s13742-015-0083-4

_version_	1782390230103359488
author	Khan, Ishita K. Wei, Qing Chapman, Samuel KC, Dukka B. Kihara, Daisuke
author_facet	Khan, Ishita K. Wei, Qing Chapman, Samuel KC, Dukka B. Kihara, Daisuke
author_sort	Khan, Ishita K.
collection	PubMed
description	BACKGROUND: Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annotation, many computational (automated) function prediction (AFP) methods have been developed. To objectively evaluate the performance of such methods on a large scale, community-wide assessment experiments have been conducted. The second round of the Critical Assessment of Function Annotation (CAFA) experiment was held in 2013–2014. Evaluation of participating groups was reported in a special interest group meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in Boston in 2014. Our group participated in both CAFA1 and CAFA2 using multiple, in-house AFP methods. Here, we report benchmark results of our methods obtained in the course of preparation for CAFA2 prior to submitting function predictions for CAFA2 targets. RESULTS: For CAFA2, we updated the annotation databases used by our methods, protein function prediction (PFP) and extended similarity group (ESG), and benchmarked their function prediction performances using the original (older) and updated databases. Performance evaluation for PFP with different settings and ESG are discussed. We also developed two ensemble methods that combine function predictions from six independent, sequence-based AFP methods. We further analyzed the performances of our prediction methods by enriching the predictions with prior distribution of gene ontology (GO) terms. Examples of predictions by the ensemble methods are discussed. CONCLUSIONS: Updating the annotation database was successful, improving the F(max) prediction accuracy score for both PFP and ESG. Adding the prior distribution of GO terms did not make much improvement. Both of the ensemble methods we developed improved the average F(max) score over all individual component methods except for ESG. Our benchmark results will not only complement the overall assessment that will be done by the CAFA organizers, but also help elucidate the predictive powers of sequence-based function prediction methods in general. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0083-4) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4570625
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-45706252015-09-16 The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches Khan, Ishita K. Wei, Qing Chapman, Samuel KC, Dukka B. Kihara, Daisuke Gigascience Research BACKGROUND: Functional annotation of novel proteins is one of the central problems in bioinformatics. With the ever-increasing development of genome sequencing technologies, more and more sequence information is becoming available to analyze and annotate. To achieve fast and automatic function annotation, many computational (automated) function prediction (AFP) methods have been developed. To objectively evaluate the performance of such methods on a large scale, community-wide assessment experiments have been conducted. The second round of the Critical Assessment of Function Annotation (CAFA) experiment was held in 2013–2014. Evaluation of participating groups was reported in a special interest group meeting at the Intelligent Systems in Molecular Biology (ISMB) conference in Boston in 2014. Our group participated in both CAFA1 and CAFA2 using multiple, in-house AFP methods. Here, we report benchmark results of our methods obtained in the course of preparation for CAFA2 prior to submitting function predictions for CAFA2 targets. RESULTS: For CAFA2, we updated the annotation databases used by our methods, protein function prediction (PFP) and extended similarity group (ESG), and benchmarked their function prediction performances using the original (older) and updated databases. Performance evaluation for PFP with different settings and ESG are discussed. We also developed two ensemble methods that combine function predictions from six independent, sequence-based AFP methods. We further analyzed the performances of our prediction methods by enriching the predictions with prior distribution of gene ontology (GO) terms. Examples of predictions by the ensemble methods are discussed. CONCLUSIONS: Updating the annotation database was successful, improving the F(max) prediction accuracy score for both PFP and ESG. Adding the prior distribution of GO terms did not make much improvement. Both of the ensemble methods we developed improved the average F(max) score over all individual component methods except for ESG. Our benchmark results will not only complement the overall assessment that will be done by the CAFA organizers, but also help elucidate the predictive powers of sequence-based function prediction methods in general. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0083-4) contains supplementary material, which is available to authorized users. BioMed Central 2015-09-14 /pmc/articles/PMC4570625/ /pubmed/26380077 http://dx.doi.org/10.1186/s13742-015-0083-4 Text en © Khan et al. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Khan, Ishita K. Wei, Qing Chapman, Samuel KC, Dukka B. Kihara, Daisuke The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
title	The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
title_full	The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
title_fullStr	The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
title_full_unstemmed	The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
title_short	The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches
title_sort	pfp and esg protein function prediction methods in 2014: effect of database updates and ensemble approaches
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4570625/ https://www.ncbi.nlm.nih.gov/pubmed/26380077 http://dx.doi.org/10.1186/s13742-015-0083-4
work_keys_str_mv	AT khanishitak thepfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT weiqing thepfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT chapmansamuel thepfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT kcdukkab thepfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT kiharadaisuke thepfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT khanishitak pfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT weiqing pfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT chapmansamuel pfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT kcdukkab pfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches AT kiharadaisuke pfpandesgproteinfunctionpredictionmethodsin2014effectofdatabaseupdatesandensembleapproaches

The PFP and ESG protein function prediction methods in 2014: effect of database updates and ensemble approaches

Ejemplares similares