Cargando…
Deep learning with word embeddings improves biomedical named entity recognition
MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870729/ https://www.ncbi.nlm.nih.gov/pubmed/28881963 http://dx.doi.org/10.1093/bioinformatics/btx228 |
_version_ | 1783309541932269568 |
---|---|
author | Habibi, Maryam Weber, Leon Neves, Mariana Wiegandt, David Luis Leser, Ulf |
author_facet | Habibi, Maryam Weber, Leon Neves, Mariana Wiegandt, David Luis Leser, Ulf |
author_sort | Habibi, Maryam |
collection | PubMed |
description | MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/. |
format | Online Article Text |
id | pubmed-5870729 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-58707292018-04-05 Deep learning with word embeddings improves biomedical named entity recognition Habibi, Maryam Weber, Leon Neves, Mariana Wiegandt, David Luis Leser, Ulf Bioinformatics Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/. Oxford University Press 2017-07-15 2017-07-12 /pmc/articles/PMC5870729/ /pubmed/28881963 http://dx.doi.org/10.1093/bioinformatics/btx228 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 Habibi, Maryam Weber, Leon Neves, Mariana Wiegandt, David Luis Leser, Ulf Deep learning with word embeddings improves biomedical named entity recognition |
title | Deep learning with word embeddings improves biomedical named entity recognition |
title_full | Deep learning with word embeddings improves biomedical named entity recognition |
title_fullStr | Deep learning with word embeddings improves biomedical named entity recognition |
title_full_unstemmed | Deep learning with word embeddings improves biomedical named entity recognition |
title_short | Deep learning with word embeddings improves biomedical named entity recognition |
title_sort | deep learning with word embeddings improves biomedical named entity recognition |
topic | Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870729/ https://www.ncbi.nlm.nih.gov/pubmed/28881963 http://dx.doi.org/10.1093/bioinformatics/btx228 |
work_keys_str_mv | AT habibimaryam deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition AT weberleon deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition AT nevesmariana deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition AT wiegandtdavidluis deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition AT leserulf deeplearningwithwordembeddingsimprovesbiomedicalnamedentityrecognition |