Cargando…

nala: text mining natural language mutation mentions

MOTIVATION: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’...

Descripción completa

Detalles Bibliográficos
Autores principales: Cejuela, Juan Miguel, Bojchevski, Aleksandar, Uhlig, Carsten, Bekmukhametov, Rustem, Kumar Karn, Sanjeev, Mahmuti, Shpend, Baghudana, Ashish, Dubey, Ankit, Satagopam, Venkata P, Rost, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870606/
https://www.ncbi.nlm.nih.gov/pubmed/28200120
http://dx.doi.org/10.1093/bioinformatics/btx083
_version_ 1783309519607037952
author Cejuela, Juan Miguel
Bojchevski, Aleksandar
Uhlig, Carsten
Bekmukhametov, Rustem
Kumar Karn, Sanjeev
Mahmuti, Shpend
Baghudana, Ashish
Dubey, Ankit
Satagopam, Venkata P
Rost, Burkhard
author_facet Cejuela, Juan Miguel
Bojchevski, Aleksandar
Uhlig, Carsten
Bekmukhametov, Rustem
Kumar Karn, Sanjeev
Mahmuti, Shpend
Baghudana, Ashish
Dubey, Ankit
Satagopam, Venkata P
Rost, Burkhard
author_sort Cejuela, Juan Miguel
collection PubMed
description MOTIVATION: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). RESULTS: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. AVAILABILITY AND IMPLEMENTATION: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5870606
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58706062018-04-05 nala: text mining natural language mutation mentions Cejuela, Juan Miguel Bojchevski, Aleksandar Uhlig, Carsten Bekmukhametov, Rustem Kumar Karn, Sanjeev Mahmuti, Shpend Baghudana, Ashish Dubey, Ankit Satagopam, Venkata P Rost, Burkhard Bioinformatics Original Papers MOTIVATION: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). RESULTS: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. AVAILABILITY AND IMPLEMENTATION: Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-06-15 2017-02-13 /pmc/articles/PMC5870606/ /pubmed/28200120 http://dx.doi.org/10.1093/bioinformatics/btx083 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Cejuela, Juan Miguel
Bojchevski, Aleksandar
Uhlig, Carsten
Bekmukhametov, Rustem
Kumar Karn, Sanjeev
Mahmuti, Shpend
Baghudana, Ashish
Dubey, Ankit
Satagopam, Venkata P
Rost, Burkhard
nala: text mining natural language mutation mentions
title nala: text mining natural language mutation mentions
title_full nala: text mining natural language mutation mentions
title_fullStr nala: text mining natural language mutation mentions
title_full_unstemmed nala: text mining natural language mutation mentions
title_short nala: text mining natural language mutation mentions
title_sort nala: text mining natural language mutation mentions
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870606/
https://www.ncbi.nlm.nih.gov/pubmed/28200120
http://dx.doi.org/10.1093/bioinformatics/btx083
work_keys_str_mv AT cejuelajuanmiguel nalatextminingnaturallanguagemutationmentions
AT bojchevskialeksandar nalatextminingnaturallanguagemutationmentions
AT uhligcarsten nalatextminingnaturallanguagemutationmentions
AT bekmukhametovrustem nalatextminingnaturallanguagemutationmentions
AT kumarkarnsanjeev nalatextminingnaturallanguagemutationmentions
AT mahmutishpend nalatextminingnaturallanguagemutationmentions
AT baghudanaashish nalatextminingnaturallanguagemutationmentions
AT dubeyankit nalatextminingnaturallanguagemutationmentions
AT satagopamvenkatap nalatextminingnaturallanguagemutationmentions
AT rostburkhard nalatextminingnaturallanguagemutationmentions