Cargando…

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in documen...

Descripción completa

Detalles Bibliográficos
Autores principales: Voskergian, Daniel, Bakir-Gungor, Burcu, Yousef, Malik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10585361/
https://www.ncbi.nlm.nih.gov/pubmed/37867598
http://dx.doi.org/10.3389/fgene.2023.1243874
_version_ 1785122939093385216
author Voskergian, Daniel
Bakir-Gungor, Burcu
Yousef, Malik
author_facet Voskergian, Daniel
Bakir-Gungor, Burcu
Yousef, Malik
author_sort Voskergian, Daniel
collection PubMed
description With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.
format Online
Article
Text
id pubmed-10585361
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-105853612023-10-20 TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information Voskergian, Daniel Bakir-Gungor, Burcu Yousef, Malik Front Genet Genetics With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers. Frontiers Media S.A. 2023-10-05 /pmc/articles/PMC10585361/ /pubmed/37867598 http://dx.doi.org/10.3389/fgene.2023.1243874 Text en Copyright © 2023 Voskergian, Bakir-Gungor and Yousef. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Voskergian, Daniel
Bakir-Gungor, Burcu
Yousef, Malik
TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_full TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_fullStr TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_full_unstemmed TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_short TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
title_sort textnettopics pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10585361/
https://www.ncbi.nlm.nih.gov/pubmed/37867598
http://dx.doi.org/10.3389/fgene.2023.1243874
work_keys_str_mv AT voskergiandaniel textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation
AT bakirgungorburcu textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation
AT yousefmalik textnettopicsproatopicmodelbasedtextclassificationforshorttextbyintegrationofsemanticanddocumenttopicdistributioninformation