Cargando…

Comparative analysis of classification techniques for topic-based biomedical literature categorisation

Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly q...

Descripción completa

Detalles Bibliográficos
Autores principales: Stepanov, Ihor, Ivasiuk, Arsentii, Yavorskyi, Oleksandr, Frolova, Alina
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10668010/
https://www.ncbi.nlm.nih.gov/pubmed/38028616
http://dx.doi.org/10.3389/fgene.2023.1238140
_version_ 1785139377887772672
author Stepanov, Ihor
Ivasiuk, Arsentii
Yavorskyi, Oleksandr
Frolova, Alina
author_facet Stepanov, Ihor
Ivasiuk, Arsentii
Yavorskyi, Oleksandr
Frolova, Alina
author_sort Stepanov, Ihor
collection PubMed
description Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients. Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data. Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution. Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development.
format Online
Article
Text
id pubmed-10668010
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-106680102023-11-07 Comparative analysis of classification techniques for topic-based biomedical literature categorisation Stepanov, Ihor Ivasiuk, Arsentii Yavorskyi, Oleksandr Frolova, Alina Front Genet Genetics Introduction: Scientific articles serve as vital sources of biomedical information, but with the yearly growth in publication volume, processing such vast amounts of information has become increasingly challenging. This difficulty is particularly pronounced when it requires the expertise of highly qualified professionals. Our research focused on the domain-specific articles classification to determine whether they contain information about drug-induced liver injury (DILI). DILI is a clinically significant condition and one of the reasons for drug registration failures. The rapid and accurate identification of drugs that may cause such conditions can prevent side effects in millions of patients. Methods: Developing a text classification method can help regulators, such as the FDA, much faster at a massive scale identify facts of potential DILI of concrete drugs. In our study, we compared several text classification methodologies, including transformers, LSTMs, information theory, and statistics-based methods. We devised a simple and interpretable text classification method that is as fast as Naïve Bayes while delivering superior performance for topic-oriented text categorisation. Moreover, we revisited techniques and methodologies to handle the imbalance of the data. Results: Transformers achieve the best results in cases if the distribution of classes and semantics of test data matches the training set. But in cases of imbalanced data, simple statistical-information theory-based models can surpass complex transformers, bringing more interpretable results that are so important for the biomedical domain. As our results show, neural networks can achieve better results if they are pre-trained on domain-specific data, and the loss function was designed to reflect the class distribution. Discussion: Overall, transformers are powerful architecture, however, in certain cases, such as topic classification, its usage can be redundant and simple statistical approaches can achieve compatible results while being much faster and explainable. However, we see potential in combining results from both worlds. Development of new neural network architectures, loss functions and training procedures that bring stability to unbalanced data is a promising topic of development. Frontiers Media S.A. 2023-11-07 /pmc/articles/PMC10668010/ /pubmed/38028616 http://dx.doi.org/10.3389/fgene.2023.1238140 Text en Copyright © 2023 Stepanov, Ivasiuk, Yavorskyi and Frolova. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Stepanov, Ihor
Ivasiuk, Arsentii
Yavorskyi, Oleksandr
Frolova, Alina
Comparative analysis of classification techniques for topic-based biomedical literature categorisation
title Comparative analysis of classification techniques for topic-based biomedical literature categorisation
title_full Comparative analysis of classification techniques for topic-based biomedical literature categorisation
title_fullStr Comparative analysis of classification techniques for topic-based biomedical literature categorisation
title_full_unstemmed Comparative analysis of classification techniques for topic-based biomedical literature categorisation
title_short Comparative analysis of classification techniques for topic-based biomedical literature categorisation
title_sort comparative analysis of classification techniques for topic-based biomedical literature categorisation
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10668010/
https://www.ncbi.nlm.nih.gov/pubmed/38028616
http://dx.doi.org/10.3389/fgene.2023.1238140
work_keys_str_mv AT stepanovihor comparativeanalysisofclassificationtechniquesfortopicbasedbiomedicalliteraturecategorisation
AT ivasiukarsentii comparativeanalysisofclassificationtechniquesfortopicbasedbiomedicalliteraturecategorisation
AT yavorskyioleksandr comparativeanalysisofclassificationtechniquesfortopicbasedbiomedicalliteraturecategorisation
AT frolovaalina comparativeanalysisofclassificationtechniquesfortopicbasedbiomedicalliteraturecategorisation