Cargando…

Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms

The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity....

Descripción completa

Detalles Bibliográficos
Autores principales: Saeed , Nadia, Naveed, Hammad
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9411640/
https://www.ncbi.nlm.nih.gov/pubmed/36032678
http://dx.doi.org/10.3389/fmolb.2022.928530
_version_ 1784775313600806912
author Saeed , Nadia
Naveed, Hammad
author_facet Saeed , Nadia
Naveed, Hammad
author_sort Saeed , Nadia
collection PubMed
description The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS—a lightweight, post-processing module—to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.
format Online
Article
Text
id pubmed-9411640
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-94116402022-08-27 Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms Saeed , Nadia Naveed, Hammad Front Mol Biosci Molecular Biosciences The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS—a lightweight, post-processing module—to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP. Frontiers Media S.A. 2022-08-12 /pmc/articles/PMC9411640/ /pubmed/36032678 http://dx.doi.org/10.3389/fmolb.2022.928530 Text en Copyright © 2022 Saeed  and Naveed. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Molecular Biosciences
Saeed , Nadia
Naveed, Hammad
Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms
title Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms
title_full Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms
title_fullStr Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms
title_full_unstemmed Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms
title_short Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms
title_sort medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms
topic Molecular Biosciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9411640/
https://www.ncbi.nlm.nih.gov/pubmed/36032678
http://dx.doi.org/10.3389/fmolb.2022.928530
work_keys_str_mv AT saeednadia medicalterminologybasedcomputingsystemalightweightpostprocessingsolutionforoutofvocabularymultiwordterms
AT naveedhammad medicalterminologybasedcomputingsystemalightweightpostprocessingsolutionforoutofvocabularymultiwordterms