Cargando…

A robust data-driven approach for gene ontology annotation

Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatical...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yanpeng, Yu, Hong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243380/
https://www.ncbi.nlm.nih.gov/pubmed/25425037
http://dx.doi.org/10.1093/database/bau113
_version_ 1782346095910715392
author Li, Yanpeng
Yu, Hong
author_facet Li, Yanpeng
Yu, Hong
author_sort Li, Yanpeng
collection PubMed
description Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.
format Online
Article
Text
id pubmed-4243380
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-42433802014-11-26 A robust data-driven approach for gene ontology annotation Li, Yanpeng Yu, Hong Database (Oxford) Original Article Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. Oxford University Press 2014-11-23 /pmc/articles/PMC4243380/ /pubmed/25425037 http://dx.doi.org/10.1093/database/bau113 Text en © The Author(s) 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Li, Yanpeng
Yu, Hong
A robust data-driven approach for gene ontology annotation
title A robust data-driven approach for gene ontology annotation
title_full A robust data-driven approach for gene ontology annotation
title_fullStr A robust data-driven approach for gene ontology annotation
title_full_unstemmed A robust data-driven approach for gene ontology annotation
title_short A robust data-driven approach for gene ontology annotation
title_sort robust data-driven approach for gene ontology annotation
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243380/
https://www.ncbi.nlm.nih.gov/pubmed/25425037
http://dx.doi.org/10.1093/database/bau113
work_keys_str_mv AT liyanpeng arobustdatadrivenapproachforgeneontologyannotation
AT yuhong arobustdatadrivenapproachforgeneontologyannotation
AT liyanpeng robustdatadrivenapproachforgeneontologyannotation
AT yuhong robustdatadrivenapproachforgeneontologyannotation