Cargando…

Combining word embeddings to extract chemical and drug entities in biomedical literature

BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS: In this paper we evaluate two importa...

Descripción completa

Detalles Bibliográficos
Autores principales: López-Úbeda, Pilar, Díaz-Galiano, Manuel Carlos, Ureña-López, L. Alfonso, Martín-Valdivia, M. Teresa
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8684055/
https://www.ncbi.nlm.nih.gov/pubmed/34920708
http://dx.doi.org/10.1186/s12859-021-04188-3
_version_ 1784617538871623680
author López-Úbeda, Pilar
Díaz-Galiano, Manuel Carlos
Ureña-López, L. Alfonso
Martín-Valdivia, M. Teresa
author_facet López-Úbeda, Pilar
Díaz-Galiano, Manuel Carlos
Ureña-López, L. Alfonso
Martín-Valdivia, M. Teresa
author_sort López-Úbeda, Pilar
collection PubMed
description BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS: In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. RESULTS: For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. CONCLUSION: On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.
format Online
Article
Text
id pubmed-8684055
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-86840552021-12-20 Combining word embeddings to extract chemical and drug entities in biomedical literature López-Úbeda, Pilar Díaz-Galiano, Manuel Carlos Ureña-López, L. Alfonso Martín-Valdivia, M. Teresa BMC Bioinformatics Research BACKGROUND: Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. METHODS: In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. RESULTS: For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. CONCLUSION: On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position. BioMed Central 2021-12-17 /pmc/articles/PMC8684055/ /pubmed/34920708 http://dx.doi.org/10.1186/s12859-021-04188-3 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
López-Úbeda, Pilar
Díaz-Galiano, Manuel Carlos
Ureña-López, L. Alfonso
Martín-Valdivia, M. Teresa
Combining word embeddings to extract chemical and drug entities in biomedical literature
title Combining word embeddings to extract chemical and drug entities in biomedical literature
title_full Combining word embeddings to extract chemical and drug entities in biomedical literature
title_fullStr Combining word embeddings to extract chemical and drug entities in biomedical literature
title_full_unstemmed Combining word embeddings to extract chemical and drug entities in biomedical literature
title_short Combining word embeddings to extract chemical and drug entities in biomedical literature
title_sort combining word embeddings to extract chemical and drug entities in biomedical literature
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8684055/
https://www.ncbi.nlm.nih.gov/pubmed/34920708
http://dx.doi.org/10.1186/s12859-021-04188-3
work_keys_str_mv AT lopezubedapilar combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature
AT diazgalianomanuelcarlos combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature
AT urenalopezlalfonso combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature
AT martinvaldiviamteresa combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature