Cargando…

Integrating information retrieval with distant supervision for Gene Ontology annotation

This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text ar...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhu, Dongqing, Li, Dingcheng, Carterette, Ben, Liu, Hongfang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4150992/
https://www.ncbi.nlm.nih.gov/pubmed/25183856
http://dx.doi.org/10.1093/database/bau087
_version_ 1782332980712177664
author Zhu, Dongqing
Li, Dingcheng
Carterette, Ben
Liu, Hongfang
author_facet Zhu, Dongqing
Li, Dingcheng
Carterette, Ben
Liu, Hongfang
author_sort Zhu, Dongqing
collection PubMed
description This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. Database URL: https://github.com/noname2020/Bioc
format Online
Article
Text
id pubmed-4150992
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-41509922014-09-03 Integrating information retrieval with distant supervision for Gene Ontology annotation Zhu, Dongqing Li, Dingcheng Carterette, Ben Liu, Hongfang Database (Oxford) Original Article This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. Database URL: https://github.com/noname2020/Bioc Oxford University Press 2014-09-01 /pmc/articles/PMC4150992/ /pubmed/25183856 http://dx.doi.org/10.1093/database/bau087 Text en © The Author(s) 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Zhu, Dongqing
Li, Dingcheng
Carterette, Ben
Liu, Hongfang
Integrating information retrieval with distant supervision for Gene Ontology annotation
title Integrating information retrieval with distant supervision for Gene Ontology annotation
title_full Integrating information retrieval with distant supervision for Gene Ontology annotation
title_fullStr Integrating information retrieval with distant supervision for Gene Ontology annotation
title_full_unstemmed Integrating information retrieval with distant supervision for Gene Ontology annotation
title_short Integrating information retrieval with distant supervision for Gene Ontology annotation
title_sort integrating information retrieval with distant supervision for gene ontology annotation
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4150992/
https://www.ncbi.nlm.nih.gov/pubmed/25183856
http://dx.doi.org/10.1093/database/bau087
work_keys_str_mv AT zhudongqing integratinginformationretrievalwithdistantsupervisionforgeneontologyannotation
AT lidingcheng integratinginformationretrievalwithdistantsupervisionforgeneontologyannotation
AT carteretteben integratinginformationretrievalwithdistantsupervisionforgeneontologyannotation
AT liuhongfang integratinginformationretrievalwithdistantsupervisionforgeneontologyannotation