Cargando…

Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets

BACKGROUND: A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have invest...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nakano, Felipe Kenji, Lietaert, Mathias, Vens, Celine
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6755698/ https://www.ncbi.nlm.nih.gov/pubmed/31547800 http://dx.doi.org/10.1186/s12859-019-3060-6

_version_	1783453285325209600
author	Nakano, Felipe Kenji Lietaert, Mathias Vens, Celine
author_facet	Nakano, Felipe Kenji Lietaert, Mathias Vens, Celine
author_sort	Nakano, Felipe Kenji
collection	PubMed
description	BACKGROUND: A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. RESULTS: The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. CONCLUSIONS: The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.
format	Online Article Text
id	pubmed-6755698
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-67556982019-09-26 Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets Nakano, Felipe Kenji Lietaert, Mathias Vens, Celine BMC Bioinformatics Research Article BACKGROUND: A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. RESULTS: The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. CONCLUSIONS: The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them. BioMed Central 2019-09-23 /pmc/articles/PMC6755698/ /pubmed/31547800 http://dx.doi.org/10.1186/s12859-019-3060-6 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Nakano, Felipe Kenji Lietaert, Mathias Vens, Celine Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets
title	Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets
title_full	Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets
title_fullStr	Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets
title_full_unstemmed	Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets
title_short	Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets
title_sort	machine learning for discovering missing or wrong protein function annotations: a comparison using updated benchmark datasets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6755698/ https://www.ncbi.nlm.nih.gov/pubmed/31547800 http://dx.doi.org/10.1186/s12859-019-3060-6
work_keys_str_mv	AT nakanofelipekenji machinelearningfordiscoveringmissingorwrongproteinfunctionannotationsacomparisonusingupdatedbenchmarkdatasets AT lietaertmathias machinelearningfordiscoveringmissingorwrongproteinfunctionannotationsacomparisonusingupdatedbenchmarkdatasets AT vensceline machinelearningfordiscoveringmissingorwrongproteinfunctionannotationsacomparisonusingupdatedbenchmarkdatasets

Machine learning for discovering missing or wrong protein function annotations: A comparison using updated benchmark datasets

Ejemplares similares