Cargando…

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of train...

Descripción completa

Detalles Bibliográficos
Autores principales:	Albalawi, Yahya, Buckley, Jim, Nikolov, Nikola S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8253467/ https://www.ncbi.nlm.nih.gov/pubmed/34249602 http://dx.doi.org/10.1186/s40537-021-00488-w

_version_	1783717519287123968
author	Albalawi, Yahya Buckley, Jim Nikolov, Nikola S.
author_facet	Albalawi, Yahya Buckley, Jim Nikolov, Nikola S.
author_sort	Albalawi, Yahya
collection	PubMed
description	This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F(1) score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F(1) score of 75.2% and accuracy of 90.7% compared to F(1) score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.
format	Online Article Text
id	pubmed-8253467
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-82534672021-07-06 Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media Albalawi, Yahya Buckley, Jim Nikolov, Nikola S. J Big Data Research This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F(1) score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F(1) score of 75.2% and accuracy of 90.7% compared to F(1) score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset. Springer International Publishing 2021-07-02 2021 /pmc/articles/PMC8253467/ /pubmed/34249602 http://dx.doi.org/10.1186/s40537-021-00488-w Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Research Albalawi, Yahya Buckley, Jim Nikolov, Nikola S. Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
title	Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
title_full	Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
title_fullStr	Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
title_full_unstemmed	Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
title_short	Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media
title_sort	investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting arabic health information on social media
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8253467/ https://www.ncbi.nlm.nih.gov/pubmed/34249602 http://dx.doi.org/10.1186/s40537-021-00488-w
work_keys_str_mv	AT albalawiyahya investigatingtheimpactofpreprocessingtechniquesandpretrainedwordembeddingsindetectingarabichealthinformationonsocialmedia AT buckleyjim investigatingtheimpactofpreprocessingtechniquesandpretrainedwordembeddingsindetectingarabichealthinformationonsocialmedia AT nikolovnikolas investigatingtheimpactofpreprocessingtechniquesandpretrainedwordembeddingsindetectingarabichealthinformationonsocialmedia

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Ejemplares similares