Cargando…

Comparing general and specialized word embeddings for biomedical named entity recognition

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ramos-Vargas, Rigo E., Román-Godínez, Israel, Torres-Ramos, Sulema
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2021
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959609/ https://www.ncbi.nlm.nih.gov/pubmed/33817030 http://dx.doi.org/10.7717/peerj-cs.384

_version_	1783664986777714688
author	Ramos-Vargas, Rigo E. Román-Godínez, Israel Torres-Ramos, Sulema
author_facet	Ramos-Vargas, Rigo E. Román-Godínez, Israel Torres-Ramos, Sulema
author_sort	Ramos-Vargas, Rigo E.
collection	PubMed
description	Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.
format	Online Article Text
id	pubmed-7959609
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-79596092021-04-02 Comparing general and specialized word embeddings for biomedical named entity recognition Ramos-Vargas, Rigo E. Román-Godínez, Israel Torres-Ramos, Sulema PeerJ Comput Sci Bioinformatics Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option. PeerJ Inc. 2021-02-18 /pmc/articles/PMC7959609/ /pubmed/33817030 http://dx.doi.org/10.7717/peerj-cs.384 Text en © 2021 Ramos-Vargas et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Ramos-Vargas, Rigo E. Román-Godínez, Israel Torres-Ramos, Sulema Comparing general and specialized word embeddings for biomedical named entity recognition
title	Comparing general and specialized word embeddings for biomedical named entity recognition
title_full	Comparing general and specialized word embeddings for biomedical named entity recognition
title_fullStr	Comparing general and specialized word embeddings for biomedical named entity recognition
title_full_unstemmed	Comparing general and specialized word embeddings for biomedical named entity recognition
title_short	Comparing general and specialized word embeddings for biomedical named entity recognition
title_sort	comparing general and specialized word embeddings for biomedical named entity recognition
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7959609/ https://www.ncbi.nlm.nih.gov/pubmed/33817030 http://dx.doi.org/10.7717/peerj-cs.384
work_keys_str_mv	AT ramosvargasrigoe comparinggeneralandspecializedwordembeddingsforbiomedicalnamedentityrecognition AT romangodinezisrael comparinggeneralandspecializedwordembeddingsforbiomedicalnamedentityrecognition AT torresramossulema comparinggeneralandspecializedwordembeddingsforbiomedicalnamedentityrecognition

Comparing general and specialized word embeddings for biomedical named entity recognition

Ejemplares similares