Cargando…
Thesaurus-based word embeddings for automated biomedical literature classification
The special nature, volume and broadness of biomedical literature pose barriers for automated classification methods. On the other hand, manually indexing is time-consuming, costly and error prone. We argue that current word embedding algorithms can be efficiently used to support the task of biomedi...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer London
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8111057/ https://www.ncbi.nlm.nih.gov/pubmed/33994670 http://dx.doi.org/10.1007/s00521-021-06053-z |
_version_ | 1783690422558654464 |
---|---|
author | Koutsomitropoulos, Dimitrios A. Andriopoulos, Andreas D. |
author_facet | Koutsomitropoulos, Dimitrios A. Andriopoulos, Andreas D. |
author_sort | Koutsomitropoulos, Dimitrios A. |
collection | PubMed |
description | The special nature, volume and broadness of biomedical literature pose barriers for automated classification methods. On the other hand, manually indexing is time-consuming, costly and error prone. We argue that current word embedding algorithms can be efficiently used to support the task of biomedical text classification even in a multilabel setting, with many distinct labels. The ontology representation of Medical Subject Headings provides machine-readable labels and specifies the dimensionality of the problem space. Both deep- and shallow network approaches are implemented. Predictions are determined by the similarity between extracted features from contextualized representations of abstracts and headings. The addition of a separate classifier for transfer learning is also proposed and evaluated. Large datasets of biomedical citations are harvested for their metadata and used for training and testing. These automated approaches are still far from entirely substituting human experts, yet they can be useful as a mechanism for validation and recommendation. Dataset balancing, distributed processing and training parallelization in GPUs, all play an important part regarding the effectiveness and performance of proposed methods. |
format | Online Article Text |
id | pubmed-8111057 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Springer London |
record_format | MEDLINE/PubMed |
spelling | pubmed-81110572021-05-11 Thesaurus-based word embeddings for automated biomedical literature classification Koutsomitropoulos, Dimitrios A. Andriopoulos, Andreas D. Neural Comput Appl Special issue on Advances of Neural Computing phasing challenges in the era of 4th industrial revolution The special nature, volume and broadness of biomedical literature pose barriers for automated classification methods. On the other hand, manually indexing is time-consuming, costly and error prone. We argue that current word embedding algorithms can be efficiently used to support the task of biomedical text classification even in a multilabel setting, with many distinct labels. The ontology representation of Medical Subject Headings provides machine-readable labels and specifies the dimensionality of the problem space. Both deep- and shallow network approaches are implemented. Predictions are determined by the similarity between extracted features from contextualized representations of abstracts and headings. The addition of a separate classifier for transfer learning is also proposed and evaluated. Large datasets of biomedical citations are harvested for their metadata and used for training and testing. These automated approaches are still far from entirely substituting human experts, yet they can be useful as a mechanism for validation and recommendation. Dataset balancing, distributed processing and training parallelization in GPUs, all play an important part regarding the effectiveness and performance of proposed methods. Springer London 2021-05-11 2022 /pmc/articles/PMC8111057/ /pubmed/33994670 http://dx.doi.org/10.1007/s00521-021-06053-z Text en © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Special issue on Advances of Neural Computing phasing challenges in the era of 4th industrial revolution Koutsomitropoulos, Dimitrios A. Andriopoulos, Andreas D. Thesaurus-based word embeddings for automated biomedical literature classification |
title | Thesaurus-based word embeddings for automated biomedical literature classification |
title_full | Thesaurus-based word embeddings for automated biomedical literature classification |
title_fullStr | Thesaurus-based word embeddings for automated biomedical literature classification |
title_full_unstemmed | Thesaurus-based word embeddings for automated biomedical literature classification |
title_short | Thesaurus-based word embeddings for automated biomedical literature classification |
title_sort | thesaurus-based word embeddings for automated biomedical literature classification |
topic | Special issue on Advances of Neural Computing phasing challenges in the era of 4th industrial revolution |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8111057/ https://www.ncbi.nlm.nih.gov/pubmed/33994670 http://dx.doi.org/10.1007/s00521-021-06053-z |
work_keys_str_mv | AT koutsomitropoulosdimitriosa thesaurusbasedwordembeddingsforautomatedbiomedicalliteratureclassification AT andriopoulosandreasd thesaurusbasedwordembeddingsforautomatedbiomedicalliteratureclassification |