Cargando…

A Combined Approach for Multi-Label Text Data Classification

Automated data analysis solutions are very dependent on data and its quality. The possibility of assigning more than one class to the same data item is one of the specificities that need to be taken into account. There are no solutions, dedicated to Lithuanian text data classification that helps to...

Descripción completa

Detalles Bibliográficos
Autores principales: Štrimaitis, Rokas, Stefanovič, Pavel, Ramanauskaitė, Simona, Slotkienė, Asta
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9242766/
https://www.ncbi.nlm.nih.gov/pubmed/35785066
http://dx.doi.org/10.1155/2022/3369703
_version_ 1784738122420977664
author Štrimaitis, Rokas
Stefanovič, Pavel
Ramanauskaitė, Simona
Slotkienė, Asta
author_facet Štrimaitis, Rokas
Stefanovič, Pavel
Ramanauskaitė, Simona
Slotkienė, Asta
author_sort Štrimaitis, Rokas
collection PubMed
description Automated data analysis solutions are very dependent on data and its quality. The possibility of assigning more than one class to the same data item is one of the specificities that need to be taken into account. There are no solutions, dedicated to Lithuanian text data classification that helps to assign more than one class to data item. In this paper, a new combined approach has been proposed for multilabel text data classification for text analysis. The main aim of the proposed approach is to improve the accuracy of traditional classification algorithms by incorporating the results obtained using similarity measures. The experimental investigation has been performed using the financial news multilabel text data in the Lithuanian language. Data have been collected from four public websites and classified by experts into ten classes manually, where each of the data items has no more than two classes. The results of five commonly used algorithms have been compared for dataset classification: the support vector machine, multinomial naive Bayes, k-nearest neighbours, decision trees, linear and discriminant analysis. In addition, two similarity measures have been compared: the cosine distance and the dice coefficient. Research has shown that the best results have been obtained using the cosine similarity distance and the multinomial naive Bayes classifier. The proposed approach combines the results of these two methods. Research on different cases of the proposed approach indicated the peculiarities of its application. At the same time, the combined approach allowed us to obtain a statistically significant increase in global accuracy.
format Online
Article
Text
id pubmed-9242766
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-92427662022-06-30 A Combined Approach for Multi-Label Text Data Classification Štrimaitis, Rokas Stefanovič, Pavel Ramanauskaitė, Simona Slotkienė, Asta Comput Intell Neurosci Research Article Automated data analysis solutions are very dependent on data and its quality. The possibility of assigning more than one class to the same data item is one of the specificities that need to be taken into account. There are no solutions, dedicated to Lithuanian text data classification that helps to assign more than one class to data item. In this paper, a new combined approach has been proposed for multilabel text data classification for text analysis. The main aim of the proposed approach is to improve the accuracy of traditional classification algorithms by incorporating the results obtained using similarity measures. The experimental investigation has been performed using the financial news multilabel text data in the Lithuanian language. Data have been collected from four public websites and classified by experts into ten classes manually, where each of the data items has no more than two classes. The results of five commonly used algorithms have been compared for dataset classification: the support vector machine, multinomial naive Bayes, k-nearest neighbours, decision trees, linear and discriminant analysis. In addition, two similarity measures have been compared: the cosine distance and the dice coefficient. Research has shown that the best results have been obtained using the cosine similarity distance and the multinomial naive Bayes classifier. The proposed approach combines the results of these two methods. Research on different cases of the proposed approach indicated the peculiarities of its application. At the same time, the combined approach allowed us to obtain a statistically significant increase in global accuracy. Hindawi 2022-06-22 /pmc/articles/PMC9242766/ /pubmed/35785066 http://dx.doi.org/10.1155/2022/3369703 Text en Copyright © 2022 Rokas Štrimaitis et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Štrimaitis, Rokas
Stefanovič, Pavel
Ramanauskaitė, Simona
Slotkienė, Asta
A Combined Approach for Multi-Label Text Data Classification
title A Combined Approach for Multi-Label Text Data Classification
title_full A Combined Approach for Multi-Label Text Data Classification
title_fullStr A Combined Approach for Multi-Label Text Data Classification
title_full_unstemmed A Combined Approach for Multi-Label Text Data Classification
title_short A Combined Approach for Multi-Label Text Data Classification
title_sort combined approach for multi-label text data classification
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9242766/
https://www.ncbi.nlm.nih.gov/pubmed/35785066
http://dx.doi.org/10.1155/2022/3369703
work_keys_str_mv AT strimaitisrokas acombinedapproachformultilabeltextdataclassification
AT stefanovicpavel acombinedapproachformultilabeltextdataclassification
AT ramanauskaitesimona acombinedapproachformultilabeltextdataclassification
AT slotkieneasta acombinedapproachformultilabeltextdataclassification
AT strimaitisrokas combinedapproachformultilabeltextdataclassification
AT stefanovicpavel combinedapproachformultilabeltextdataclassification
AT ramanauskaitesimona combinedapproachformultilabeltextdataclassification
AT slotkieneasta combinedapproachformultilabeltextdataclassification