Cargando…
Combining word embeddings to extract chemical and drug entities in biomedical literature
BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS: In this paper we evaluate two importa...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8684055/ https://www.ncbi.nlm.nih.gov/pubmed/34920708 http://dx.doi.org/10.1186/s12859-021-04188-3 |
_version_ | 1784617538871623680 |
---|---|
author | López-Úbeda, Pilar Díaz-Galiano, Manuel Carlos Ureña-López, L. Alfonso Martín-Valdivia, M. Teresa |
author_facet | López-Úbeda, Pilar Díaz-Galiano, Manuel Carlos Ureña-López, L. Alfonso Martín-Valdivia, M. Teresa |
author_sort | López-Úbeda, Pilar |
collection | PubMed |
description | BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS: In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. RESULTS: For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. CONCLUSION: On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position. |
format | Online Article Text |
id | pubmed-8684055 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-86840552021-12-20 Combining word embeddings to extract chemical and drug entities in biomedical literature López-Úbeda, Pilar Díaz-Galiano, Manuel Carlos Ureña-López, L. Alfonso Martín-Valdivia, M. Teresa BMC Bioinformatics Research BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS: In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. RESULTS: For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. CONCLUSION: On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position. BioMed Central 2021-12-17 /pmc/articles/PMC8684055/ /pubmed/34920708 http://dx.doi.org/10.1186/s12859-021-04188-3 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research López-Úbeda, Pilar Díaz-Galiano, Manuel Carlos Ureña-López, L. Alfonso Martín-Valdivia, M. Teresa Combining word embeddings to extract chemical and drug entities in biomedical literature |
title | Combining word embeddings to extract chemical and drug entities in biomedical literature |
title_full | Combining word embeddings to extract chemical and drug entities in biomedical literature |
title_fullStr | Combining word embeddings to extract chemical and drug entities in biomedical literature |
title_full_unstemmed | Combining word embeddings to extract chemical and drug entities in biomedical literature |
title_short | Combining word embeddings to extract chemical and drug entities in biomedical literature |
title_sort | combining word embeddings to extract chemical and drug entities in biomedical literature |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8684055/ https://www.ncbi.nlm.nih.gov/pubmed/34920708 http://dx.doi.org/10.1186/s12859-021-04188-3 |
work_keys_str_mv | AT lopezubedapilar combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature AT diazgalianomanuelcarlos combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature AT urenalopezlalfonso combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature AT martinvaldiviamteresa combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature |