Cargando…
nala: text mining natural language mutation mentions
MOTIVATION: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870606/ https://www.ncbi.nlm.nih.gov/pubmed/28200120 http://dx.doi.org/10.1093/bioinformatics/btx083 |
_version_ | 1783309519607037952 |
---|---|
author | Cejuela, Juan Miguel Bojchevski, Aleksandar Uhlig, Carsten Bekmukhametov, Rustem Kumar Karn, Sanjeev Mahmuti, Shpend Baghudana, Ashish Dubey, Ankit Satagopam, Venkata P Rost, Burkhard |
author_facet | Cejuela, Juan Miguel Bojchevski, Aleksandar Uhlig, Carsten Bekmukhametov, Rustem Kumar Karn, Sanjeev Mahmuti, Shpend Baghudana, Ashish Dubey, Ankit Satagopam, Venkata P Rost, Burkhard |
author_sort | Cejuela, Juan Miguel |
collection | PubMed |
description | MOTIVATION: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). RESULTS: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. AVAILABILITY AND IMPLEMENTATION: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-5870606 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-58706062018-04-05 nala: text mining natural language mutation mentions Cejuela, Juan Miguel Bojchevski, Aleksandar Uhlig, Carsten Bekmukhametov, Rustem Kumar Karn, Sanjeev Mahmuti, Shpend Baghudana, Ashish Dubey, Ankit Satagopam, Venkata P Rost, Burkhard Bioinformatics Original Papers MOTIVATION: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). RESULTS: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. AVAILABILITY AND IMPLEMENTATION: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-06-15 2017-02-13 /pmc/articles/PMC5870606/ /pubmed/28200120 http://dx.doi.org/10.1093/bioinformatics/btx083 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Cejuela, Juan Miguel Bojchevski, Aleksandar Uhlig, Carsten Bekmukhametov, Rustem Kumar Karn, Sanjeev Mahmuti, Shpend Baghudana, Ashish Dubey, Ankit Satagopam, Venkata P Rost, Burkhard nala: text mining natural language mutation mentions |
title |
nala: text mining natural language mutation mentions |
title_full |
nala: text mining natural language mutation mentions |
title_fullStr |
nala: text mining natural language mutation mentions |
title_full_unstemmed |
nala: text mining natural language mutation mentions |
title_short |
nala: text mining natural language mutation mentions |
title_sort | nala: text mining natural language mutation mentions |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870606/ https://www.ncbi.nlm.nih.gov/pubmed/28200120 http://dx.doi.org/10.1093/bioinformatics/btx083 |
work_keys_str_mv | AT cejuelajuanmiguel nalatextminingnaturallanguagemutationmentions AT bojchevskialeksandar nalatextminingnaturallanguagemutationmentions AT uhligcarsten nalatextminingnaturallanguagemutationmentions AT bekmukhametovrustem nalatextminingnaturallanguagemutationmentions AT kumarkarnsanjeev nalatextminingnaturallanguagemutationmentions AT mahmutishpend nalatextminingnaturallanguagemutationmentions AT baghudanaashish nalatextminingnaturallanguagemutationmentions AT dubeyankit nalatextminingnaturallanguagemutationmentions AT satagopamvenkatap nalatextminingnaturallanguagemutationmentions AT rostburkhard nalatextminingnaturallanguagemutationmentions |