Cargando…

Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification

Text pre-processing is an important component of a Chinese text classification. At present, however, most of the studies on this topic focus on exploring the influence of preprocessing methods on a few text classification algorithms using English text. In this paper we experimentally compared fiftee...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Dezheng, Li, Jing, Xie, Yonghong, Wulamu, Aziguli
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569603/ https://www.ncbi.nlm.nih.gov/pubmed/37824464 http://dx.doi.org/10.1371/journal.pone.0292582

_version_	1785119582115069952
author	Zhang, Dezheng Li, Jing Xie, Yonghong Wulamu, Aziguli
author_facet	Zhang, Dezheng Li, Jing Xie, Yonghong Wulamu, Aziguli
author_sort	Zhang, Dezheng
collection	PubMed
description	Text pre-processing is an important component of a Chinese text classification. At present, however, most of the studies on this topic focus on exploring the influence of preprocessing methods on a few text classification algorithms using English text. In this paper we experimentally compared fifteen commonly used classifiers on two Chinese datasets using three widely used Chinese preprocessing methods that include word segmentation, Chinese specific stop word removal, and Chinese specific symbol removal. We then explored the influence of the preprocessing methods on the final classifications according to various conditions such as classification evaluation, combination style, and classifier selection. Finally, we conducted a battery of various additional experiments, and found that most of the classifiers improved in performance after proper preprocessing was applied. Our general conclusion is that the systematic use of preprocessing methods can have a positive impact on the classification of Chinese short text, using classification evaluation such as macro-F1, combination of preprocessing methods such as word segmentation, Chinese specific stop word and symbol removal, and classifier selection such as machine and deep learning models. We find that the best macro-f1s for categorizing text for the two datasets are 92.13% and 91.99%, which represent improvements of 0.3% and 2%, respectively over the compared baselines.
format	Online Article Text
id	pubmed-10569603
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-105696032023-10-13 Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification Zhang, Dezheng Li, Jing Xie, Yonghong Wulamu, Aziguli PLoS One Research Article Text pre-processing is an important component of a Chinese text classification. At present, however, most of the studies on this topic focus on exploring the influence of preprocessing methods on a few text classification algorithms using English text. In this paper we experimentally compared fifteen commonly used classifiers on two Chinese datasets using three widely used Chinese preprocessing methods that include word segmentation, Chinese specific stop word removal, and Chinese specific symbol removal. We then explored the influence of the preprocessing methods on the final classifications according to various conditions such as classification evaluation, combination style, and classifier selection. Finally, we conducted a battery of various additional experiments, and found that most of the classifiers improved in performance after proper preprocessing was applied. Our general conclusion is that the systematic use of preprocessing methods can have a positive impact on the classification of Chinese short text, using classification evaluation such as macro-F1, combination of preprocessing methods such as word segmentation, Chinese specific stop word and symbol removal, and classifier selection such as machine and deep learning models. We find that the best macro-f1s for categorizing text for the two datasets are 92.13% and 91.99%, which represent improvements of 0.3% and 2%, respectively over the compared baselines. Public Library of Science 2023-10-12 /pmc/articles/PMC10569603/ /pubmed/37824464 http://dx.doi.org/10.1371/journal.pone.0292582 Text en © 2023 Zhang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Zhang, Dezheng Li, Jing Xie, Yonghong Wulamu, Aziguli Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification
title	Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification
title_full	Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification
title_fullStr	Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification
title_full_unstemmed	Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification
title_short	Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification
title_sort	research on performance variations of classifiers with the influence of pre-processing methods for chinese short text classification
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569603/ https://www.ncbi.nlm.nih.gov/pubmed/37824464 http://dx.doi.org/10.1371/journal.pone.0292582
work_keys_str_mv	AT zhangdezheng researchonperformancevariationsofclassifierswiththeinfluenceofpreprocessingmethodsforchineseshorttextclassification AT lijing researchonperformancevariationsofclassifierswiththeinfluenceofpreprocessingmethodsforchineseshorttextclassification AT xieyonghong researchonperformancevariationsofclassifierswiththeinfluenceofpreprocessingmethodsforchineseshorttextclassification AT wulamuaziguli researchonperformancevariationsofclassifierswiththeinfluenceofpreprocessingmethodsforchineseshorttextclassification

Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification

Ejemplares similares