Cargando…

Deep learning with word embeddings improves biomedical named entity recognition

MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface...

Descripción completa

Detalles Bibliográficos
Autores principales: Habibi, Maryam, Weber, Leon, Neves, Mariana, Wiegandt, David Luis, Leser, Ulf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870729/
https://www.ncbi.nlm.nih.gov/pubmed/28881963
http://dx.doi.org/10.1093/bioinformatics/btx228
_version_ 1783309541932269568
author Habibi, Maryam
Weber, Leon
Neves, Mariana
Wiegandt, David Luis
Leser, Ulf
author_facet Habibi, Maryam
Weber, Leon
Neves, Mariana
Wiegandt, David Luis
Leser, Ulf
author_sort Habibi, Maryam
collection PubMed
description MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/.
format Online
Article
Text
id pubmed-5870729
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58707292018-04-05 Deep learning with word embeddings improves biomedical named entity recognition Habibi, Maryam Weber, Leon Neves, Mariana Wiegandt, David Luis Leser, Ulf Bioinformatics Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/. Oxford University Press 2017-07-15 2017-07-12 /pmc/articles/PMC5870729/ /pubmed/28881963 http://dx.doi.org/10.1093/bioinformatics/btx228 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
Habibi, Maryam
Weber, Leon
Neves, Mariana
Wiegandt, David Luis
Leser, Ulf
Deep learning with word embeddings improves biomedical named entity recognition
title Deep learning with word embeddings improves biomedical named entity recognition
title_full Deep learning with word embeddings improves biomedical named entity recognition
title_fullStr Deep learning with word embeddings improves biomedical named entity recognition
title_full_unstemmed Deep learning with word embeddings improves biomedical named entity recognition
title_short Deep learning with word embeddings improves biomedical named entity recognition
title_sort deep learning with word embeddings improves biomedical named entity recognition
topic Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870729/
https://www.ncbi.nlm.nih.gov/pubmed/28881963
http://dx.doi.org/10.1093/bioinformatics/btx228
work_keys_str_mv AT habibimaryam deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition
AT weberleon deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition
AT nevesmariana deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition
AT wiegandtdavidluis deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition
AT leserulf deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition