Cargando…

A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts

Short message services (SMS), microblogging tools, instant message apps, and commercial websites produce numerous short text messages every day. These short text messages are usually guaranteed to reach mass audience with low cost. Spammers take advantage of short texts by sending bulk malicious or...

Descripción completa

Detalles Bibliográficos
Autores principales: Xia, Tian, Chen, Xuemin, Wang, Jiacun, Qiu, Feng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10649562/
https://www.ncbi.nlm.nih.gov/pubmed/37960672
http://dx.doi.org/10.3390/s23218975
_version_ 1785135580773875712
author Xia, Tian
Chen, Xuemin
Wang, Jiacun
Qiu, Feng
author_facet Xia, Tian
Chen, Xuemin
Wang, Jiacun
Qiu, Feng
author_sort Xia, Tian
collection PubMed
description Short message services (SMS), microblogging tools, instant message apps, and commercial websites produce numerous short text messages every day. These short text messages are usually guaranteed to reach mass audience with low cost. Spammers take advantage of short texts by sending bulk malicious or unwanted messages. Short texts are difficult to classify because of their shortness, sparsity, rapidness, and informal writing. The effectiveness of the hidden Markov model (HMM) for short text classification has been illustrated in our previous study. However, the HMM has limited capability to handle new words, which are mostly generated by informal writing. In this paper, a hybrid model is proposed to address the informal writing issue by weighting new words for fast short text filtering with high accuracy. The hybrid model consists of an artificial neural network (ANN) and an HMM, which are used for new word weighting and spam filtering, respectively. The weight of a new word is calculated based on the weights of its neighbor, along with the spam and ham (i.e., not spam) probabilities of short text message predicted by the ANN. Performance evaluations on benchmark datasets, including the SMS message data maintained by University of California, Irvine; the movie reviews, and the customer reviews are conducted. The hybrid model operates at a significantly higher speed than deep learning models. The experiment results show that the proposed hybrid model outperforms other prominent machine learning algorithms, achieving a good balance between filtering throughput and accuracy.
format Online
Article
Text
id pubmed-10649562
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-106495622023-11-04 A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts Xia, Tian Chen, Xuemin Wang, Jiacun Qiu, Feng Sensors (Basel) Article Short message services (SMS), microblogging tools, instant message apps, and commercial websites produce numerous short text messages every day. These short text messages are usually guaranteed to reach mass audience with low cost. Spammers take advantage of short texts by sending bulk malicious or unwanted messages. Short texts are difficult to classify because of their shortness, sparsity, rapidness, and informal writing. The effectiveness of the hidden Markov model (HMM) for short text classification has been illustrated in our previous study. However, the HMM has limited capability to handle new words, which are mostly generated by informal writing. In this paper, a hybrid model is proposed to address the informal writing issue by weighting new words for fast short text filtering with high accuracy. The hybrid model consists of an artificial neural network (ANN) and an HMM, which are used for new word weighting and spam filtering, respectively. The weight of a new word is calculated based on the weights of its neighbor, along with the spam and ham (i.e., not spam) probabilities of short text message predicted by the ANN. Performance evaluations on benchmark datasets, including the SMS message data maintained by University of California, Irvine; the movie reviews, and the customer reviews are conducted. The hybrid model operates at a significantly higher speed than deep learning models. The experiment results show that the proposed hybrid model outperforms other prominent machine learning algorithms, achieving a good balance between filtering throughput and accuracy. MDPI 2023-11-04 /pmc/articles/PMC10649562/ /pubmed/37960672 http://dx.doi.org/10.3390/s23218975 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Xia, Tian
Chen, Xuemin
Wang, Jiacun
Qiu, Feng
A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts
title A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts
title_full A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts
title_fullStr A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts
title_full_unstemmed A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts
title_short A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts
title_sort hybrid model with new word weighting for fast filtering spam short texts
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10649562/
https://www.ncbi.nlm.nih.gov/pubmed/37960672
http://dx.doi.org/10.3390/s23218975
work_keys_str_mv AT xiatian ahybridmodelwithnewwordweightingforfastfilteringspamshorttexts
AT chenxuemin ahybridmodelwithnewwordweightingforfastfilteringspamshorttexts
AT wangjiacun ahybridmodelwithnewwordweightingforfastfilteringspamshorttexts
AT qiufeng ahybridmodelwithnewwordweightingforfastfilteringspamshorttexts
AT xiatian hybridmodelwithnewwordweightingforfastfilteringspamshorttexts
AT chenxuemin hybridmodelwithnewwordweightingforfastfilteringspamshorttexts
AT wangjiacun hybridmodelwithnewwordweightingforfastfilteringspamshorttexts
AT qiufeng hybridmodelwithnewwordweightingforfastfilteringspamshorttexts