Cargando…

Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions

This article focuses on the problem of detecting toxicity in online discussions. Toxicity is currently a serious problem when people are largely influenced by opinions on social networks. We offer a solution based on classification models using machine learning methods to classify short texts on soc...

Descripción completa

Detalles Bibliográficos
Autores principales: Machová, Kristína, Mach, Marián, Adamišín, Kamil
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9459955/
https://www.ncbi.nlm.nih.gov/pubmed/36080927
http://dx.doi.org/10.3390/s22176468
_version_ 1784786631990968320
author Machová, Kristína
Mach, Marián
Adamišín, Kamil
author_facet Machová, Kristína
Mach, Marián
Adamišín, Kamil
author_sort Machová, Kristína
collection PubMed
description This article focuses on the problem of detecting toxicity in online discussions. Toxicity is currently a serious problem when people are largely influenced by opinions on social networks. We offer a solution based on classification models using machine learning methods to classify short texts on social networks into multiple degrees of toxicity. The classification models used both classic methods of machine learning, such as naïve Bayes and SVM (support vector machine) as well ensemble methods, such as bagging and RF (random forest). The models were created using text data, which we extracted from social networks in the Slovak language. The labelling of our dataset of short texts into multiple classes—the degrees of toxicity—was provided automatically by our method based on the lexicon approach to texts processing. This lexicon method required creating a dictionary of toxic words in the Slovak language, which is another contribution of the work. Finally, an application was created based on the learned machine learning models, which can be used to detect the degree of toxicity of new social network comments as well as for experimentation with various machine learning methods. We achieved the best results using an SVM—average value of accuracy = 0.89 and F1 = 0.79. This model also outperformed the ensemble learning by the RF and Bagging methods; however, the ensemble learning methods achieved better results than the naïve Bayes method.
format Online
Article
Text
id pubmed-9459955
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-94599552022-09-10 Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions Machová, Kristína Mach, Marián Adamišín, Kamil Sensors (Basel) Article This article focuses on the problem of detecting toxicity in online discussions. Toxicity is currently a serious problem when people are largely influenced by opinions on social networks. We offer a solution based on classification models using machine learning methods to classify short texts on social networks into multiple degrees of toxicity. The classification models used both classic methods of machine learning, such as naïve Bayes and SVM (support vector machine) as well ensemble methods, such as bagging and RF (random forest). The models were created using text data, which we extracted from social networks in the Slovak language. The labelling of our dataset of short texts into multiple classes—the degrees of toxicity—was provided automatically by our method based on the lexicon approach to texts processing. This lexicon method required creating a dictionary of toxic words in the Slovak language, which is another contribution of the work. Finally, an application was created based on the learned machine learning models, which can be used to detect the degree of toxicity of new social network comments as well as for experimentation with various machine learning methods. We achieved the best results using an SVM—average value of accuracy = 0.89 and F1 = 0.79. This model also outperformed the ensemble learning by the RF and Bagging methods; however, the ensemble learning methods achieved better results than the naïve Bayes method. MDPI 2022-08-27 /pmc/articles/PMC9459955/ /pubmed/36080927 http://dx.doi.org/10.3390/s22176468 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Machová, Kristína
Mach, Marián
Adamišín, Kamil
Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions
title Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions
title_full Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions
title_fullStr Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions
title_full_unstemmed Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions
title_short Machine Learning and Lexicon Approach to Texts Processing in the Detection of Degrees of Toxicity in Online Discussions
title_sort machine learning and lexicon approach to texts processing in the detection of degrees of toxicity in online discussions
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9459955/
https://www.ncbi.nlm.nih.gov/pubmed/36080927
http://dx.doi.org/10.3390/s22176468
work_keys_str_mv AT machovakristina machinelearningandlexiconapproachtotextsprocessinginthedetectionofdegreesoftoxicityinonlinediscussions
AT machmarian machinelearningandlexiconapproachtotextsprocessinginthedetectionofdegreesoftoxicityinonlinediscussions
AT adamisinkamil machinelearningandlexiconapproachtotextsprocessinginthedetectionofdegreesoftoxicityinonlinediscussions