Cargando…
SPRENO: a BioC module for identifying organism terms in figure captions
Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6007219/ https://www.ncbi.nlm.nih.gov/pubmed/29873706 http://dx.doi.org/10.1093/database/bay048 |
_version_ | 1783332995041591296 |
---|---|
author | Dai, Hong-Jie Singh, Onkar |
author_facet | Dai, Hong-Jie Singh, Onkar |
author_sort | Dai, Hong-Jie |
collection | PubMed |
description | Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the form of figures in published papers. There is no denying that such findings have been instrumental in intensive understanding of biological processes and pathways. However, data as such is unacknowledged by machines as the descriptions in the figure captions comprise of sumptuous information in an ambiguous manner. The abbreviated term ‘SIN’ exemplifies such issue as it may stand for Sindbis virus or the sex-lethal interactor gene (Drosophila melanogaster). To overcome this ambiguity, entities should be identified by linking them to the respective entries in notable biological databases. Among all entity types, the task of identifying species plays a pivotal role in disambiguating related entities in the text. In this study, we present our species identification tool SPRENO (Species Recognition and Normalization), which is established for recognizing organism terms mentioned in figure captions and linking them to the NCBI taxonomy database by exploiting the contextual information from both the figure caption and the corresponding full text. To determine the ID of ambiguous organism mentions, two disambiguation methods have been developed. One is based on the majority rule to select the ID that has been successfully linked to previously mentioned organism terms. The other is a convolutional neural network (CNN) model trained by learning both the context and the distance information of the target organism mention. As a system based on the majority rule, SPRENO was one of the top-ranked systems in the BioCreative VI BioID track and achieved micro F-scores of 0.776 (entity recognition) and 0.755 (entity normalization) on the official test set, respectively. Additionally, the SPRENO-CNN exhibited better precisions with lower recalls and F-scores (0.720/0.711 for entity recognition/normalization). SPRENO is freely available at https://bigodatamining.github.io/software/201801/. Database URL: https://bigodatamining.github.io/software/201801/ |
format | Online Article Text |
id | pubmed-6007219 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-60072192018-06-25 SPRENO: a BioC module for identifying organism terms in figure captions Dai, Hong-Jie Singh, Onkar Database (Oxford) Original Article Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the form of figures in published papers. There is no denying that such findings have been instrumental in intensive understanding of biological processes and pathways. However, data as such is unacknowledged by machines as the descriptions in the figure captions comprise of sumptuous information in an ambiguous manner. The abbreviated term ‘SIN’ exemplifies such issue as it may stand for Sindbis virus or the sex-lethal interactor gene (Drosophila melanogaster). To overcome this ambiguity, entities should be identified by linking them to the respective entries in notable biological databases. Among all entity types, the task of identifying species plays a pivotal role in disambiguating related entities in the text. In this study, we present our species identification tool SPRENO (Species Recognition and Normalization), which is established for recognizing organism terms mentioned in figure captions and linking them to the NCBI taxonomy database by exploiting the contextual information from both the figure caption and the corresponding full text. To determine the ID of ambiguous organism mentions, two disambiguation methods have been developed. One is based on the majority rule to select the ID that has been successfully linked to previously mentioned organism terms. The other is a convolutional neural network (CNN) model trained by learning both the context and the distance information of the target organism mention. As a system based on the majority rule, SPRENO was one of the top-ranked systems in the BioCreative VI BioID track and achieved micro F-scores of 0.776 (entity recognition) and 0.755 (entity normalization) on the official test set, respectively. Additionally, the SPRENO-CNN exhibited better precisions with lower recalls and F-scores (0.720/0.711 for entity recognition/normalization). SPRENO is freely available at https://bigodatamining.github.io/software/201801/. Database URL: https://bigodatamining.github.io/software/201801/ Oxford University Press 2018-06-03 /pmc/articles/PMC6007219/ /pubmed/29873706 http://dx.doi.org/10.1093/database/bay048 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Dai, Hong-Jie Singh, Onkar SPRENO: a BioC module for identifying organism terms in figure captions |
title | SPRENO: a BioC module for identifying organism terms in figure captions |
title_full | SPRENO: a BioC module for identifying organism terms in figure captions |
title_fullStr | SPRENO: a BioC module for identifying organism terms in figure captions |
title_full_unstemmed | SPRENO: a BioC module for identifying organism terms in figure captions |
title_short | SPRENO: a BioC module for identifying organism terms in figure captions |
title_sort | spreno: a bioc module for identifying organism terms in figure captions |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6007219/ https://www.ncbi.nlm.nih.gov/pubmed/29873706 http://dx.doi.org/10.1093/database/bay048 |
work_keys_str_mv | AT daihongjie sprenoabiocmoduleforidentifyingorganismtermsinfigurecaptions AT singhonkar sprenoabiocmoduleforidentifyingorganismtermsinfigurecaptions |