Cargando…

A robust data-driven approach for gene ontology annotation

Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatical...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Yanpeng, Yu, Hong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2014
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243380/ https://www.ncbi.nlm.nih.gov/pubmed/25425037 http://dx.doi.org/10.1093/database/bau113

_version_	1782346095910715392
author	Li, Yanpeng Yu, Hong
author_facet	Li, Yanpeng Yu, Hong
author_sort	Li, Yanpeng
collection	PubMed
description	Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.
format	Online Article Text
id	pubmed-4243380
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-42433802014-11-26 A robust data-driven approach for gene ontology annotation Li, Yanpeng Yu, Hong Database (Oxford) Original Article Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. Oxford University Press 2014-11-23 /pmc/articles/PMC4243380/ /pubmed/25425037 http://dx.doi.org/10.1093/database/bau113 Text en © The Author(s) 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Li, Yanpeng Yu, Hong A robust data-driven approach for gene ontology annotation
title	A robust data-driven approach for gene ontology annotation
title_full	A robust data-driven approach for gene ontology annotation
title_fullStr	A robust data-driven approach for gene ontology annotation
title_full_unstemmed	A robust data-driven approach for gene ontology annotation
title_short	A robust data-driven approach for gene ontology annotation
title_sort	robust data-driven approach for gene ontology annotation
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243380/ https://www.ncbi.nlm.nih.gov/pubmed/25425037 http://dx.doi.org/10.1093/database/bau113
work_keys_str_mv	AT liyanpeng arobustdatadrivenapproachforgeneontologyannotation AT yuhong arobustdatadrivenapproachforgeneontologyannotation AT liyanpeng robustdatadrivenapproachforgeneontologyannotation AT yuhong robustdatadrivenapproachforgeneontologyannotation

A robust data-driven approach for gene ontology annotation

Ejemplares similares