Cargando…

SPRENO: a BioC module for identifying organism terms in figure captions

Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the...

Descripción completa

Detalles Bibliográficos
Autores principales: Dai, Hong-Jie, Singh, Onkar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6007219/
https://www.ncbi.nlm.nih.gov/pubmed/29873706
http://dx.doi.org/10.1093/database/bay048
_version_ 1783332995041591296
author Dai, Hong-Jie
Singh, Onkar
author_facet Dai, Hong-Jie
Singh, Onkar
author_sort Dai, Hong-Jie
collection PubMed
description Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the form of figures in published papers. There is no denying that such findings have been instrumental in intensive understanding of biological processes and pathways. However, data as such is unacknowledged by machines as the descriptions in the figure captions comprise of sumptuous information in an ambiguous manner. The abbreviated term ‘SIN’ exemplifies such issue as it may stand for Sindbis virus or the sex-lethal interactor gene (Drosophila melanogaster). To overcome this ambiguity, entities should be identified by linking them to the respective entries in notable biological databases. Among all entity types, the task of identifying species plays a pivotal role in disambiguating related entities in the text. In this study, we present our species identification tool SPRENO (Species Recognition and Normalization), which is established for recognizing organism terms mentioned in figure captions and linking them to the NCBI taxonomy database by exploiting the contextual information from both the figure caption and the corresponding full text. To determine the ID of ambiguous organism mentions, two disambiguation methods have been developed. One is based on the majority rule to select the ID that has been successfully linked to previously mentioned organism terms. The other is a convolutional neural network (CNN) model trained by learning both the context and the distance information of the target organism mention. As a system based on the majority rule, SPRENO was one of the top-ranked systems in the BioCreative VI BioID track and achieved micro F-scores of 0.776 (entity recognition) and 0.755 (entity normalization) on the official test set, respectively. Additionally, the SPRENO-CNN exhibited better precisions with lower recalls and F-scores (0.720/0.711 for entity recognition/normalization). SPRENO is freely available at https://bigodatamining.github.io/software/201801/. Database URL: https://bigodatamining.github.io/software/201801/
format Online
Article
Text
id pubmed-6007219
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-60072192018-06-25 SPRENO: a BioC module for identifying organism terms in figure captions Dai, Hong-Jie Singh, Onkar Database (Oxford) Original Article Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the form of figures in published papers. There is no denying that such findings have been instrumental in intensive understanding of biological processes and pathways. However, data as such is unacknowledged by machines as the descriptions in the figure captions comprise of sumptuous information in an ambiguous manner. The abbreviated term ‘SIN’ exemplifies such issue as it may stand for Sindbis virus or the sex-lethal interactor gene (Drosophila melanogaster). To overcome this ambiguity, entities should be identified by linking them to the respective entries in notable biological databases. Among all entity types, the task of identifying species plays a pivotal role in disambiguating related entities in the text. In this study, we present our species identification tool SPRENO (Species Recognition and Normalization), which is established for recognizing organism terms mentioned in figure captions and linking them to the NCBI taxonomy database by exploiting the contextual information from both the figure caption and the corresponding full text. To determine the ID of ambiguous organism mentions, two disambiguation methods have been developed. One is based on the majority rule to select the ID that has been successfully linked to previously mentioned organism terms. The other is a convolutional neural network (CNN) model trained by learning both the context and the distance information of the target organism mention. As a system based on the majority rule, SPRENO was one of the top-ranked systems in the BioCreative VI BioID track and achieved micro F-scores of 0.776 (entity recognition) and 0.755 (entity normalization) on the official test set, respectively. Additionally, the SPRENO-CNN exhibited better precisions with lower recalls and F-scores (0.720/0.711 for entity recognition/normalization). SPRENO is freely available at https://bigodatamining.github.io/software/201801/. Database URL: https://bigodatamining.github.io/software/201801/ Oxford University Press 2018-06-03 /pmc/articles/PMC6007219/ /pubmed/29873706 http://dx.doi.org/10.1093/database/bay048 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Dai, Hong-Jie
Singh, Onkar
SPRENO: a BioC module for identifying organism terms in figure captions
title SPRENO: a BioC module for identifying organism terms in figure captions
title_full SPRENO: a BioC module for identifying organism terms in figure captions
title_fullStr SPRENO: a BioC module for identifying organism terms in figure captions
title_full_unstemmed SPRENO: a BioC module for identifying organism terms in figure captions
title_short SPRENO: a BioC module for identifying organism terms in figure captions
title_sort spreno: a bioc module for identifying organism terms in figure captions
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6007219/
https://www.ncbi.nlm.nih.gov/pubmed/29873706
http://dx.doi.org/10.1093/database/bay048
work_keys_str_mv AT daihongjie sprenoabiocmoduleforidentifyingorganismtermsinfigurecaptions
AT singhonkar sprenoabiocmoduleforidentifyingorganismtermsinfigurecaptions