Cargando…

Detection of offensive terms in resource-poor language using machine learning algorithms

The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of conten...

Descripción completa

Detalles Bibliográficos
Autores principales:	Raza, Muhammad Owais, Mahoto, Naeem Ahmed, Hamdi, Mohammed, Reshan, Mana Saleh Al, Rajab, Adel, Shaikh, Asadullah
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Algorithms and Analysis of Algorithms
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10496005/ https://www.ncbi.nlm.nih.gov/pubmed/37705647 http://dx.doi.org/10.7717/peerj-cs.1524

_version_	1785105016124604416
author	Raza, Muhammad Owais Mahoto, Naeem Ahmed Hamdi, Mohammed Reshan, Mana Saleh Al Rajab, Adel Shaikh, Asadullah
author_facet	Raza, Muhammad Owais Mahoto, Naeem Ahmed Hamdi, Mohammed Reshan, Mana Saleh Al Rajab, Adel Shaikh, Asadullah
author_sort	Raza, Muhammad Owais
collection	PubMed
description	The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of content generated at a higher speed makes it humanly impossible to categorise and detect offensive terms. Besides, it is an open challenge for natural language processing (NLP) to detect such terminologies automatically. Substantial efforts are made for high-resource languages such as English. However, it becomes more challenging when dealing with resource-poor languages such as Urdu. Because of the lack of standard datasets and pre-processing tools for automatic offensive terms detection. This paper introduces a combinatorial pre-processing approach in developing a classification model for cross-platform (Twitter and YouTube) use. The approach uses datasets from two different platforms (Twitter and YouTube) the training and testing the model, which is trained to apply decision tree, random forest and naive Bayes algorithms. The proposed combinatorial pre-processing approach is applied to check how machine learning models behave with different combinations of standard pre-processing techniques for low-resource language in the cross-platform setting. The experimental results represent the effectiveness of the machine learning model over different subsets of traditional pre-processing approaches in building a classification model for automatic offensive terms detection for a low resource language, i.e., Urdu, in the cross-platform scenario. In the experiments, when dataset D1 is used for training and D2 is applied for testing, the pre-processing approach named Stopword removal produced better results with an accuracy of 83.27%. Whilst, in this case, when dataset D2 is used for training and D1 is applied for testing, stopword removal and punctuation removal were observed as a better preprocessing approach with an accuracy of 74.54%. The combinatorial approach proposed in this paper outperformed the benchmark for the considered datasets using classical as well as ensemble machine learning with an accuracy of 82.9% and 97.2% for dataset D1 and D2, respectively.
format	Online Article Text
id	pubmed-10496005
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-104960052023-09-13 Detection of offensive terms in resource-poor language using machine learning algorithms Raza, Muhammad Owais Mahoto, Naeem Ahmed Hamdi, Mohammed Reshan, Mana Saleh Al Rajab, Adel Shaikh, Asadullah PeerJ Comput Sci Algorithms and Analysis of Algorithms The use of offensive terms in user-generated content on different social media platforms is one of the major concerns for these platforms. The offensive terms have a negative impact on individuals, which may lead towards the degradation of societal and civilized manners. The immense amount of content generated at a higher speed makes it humanly impossible to categorise and detect offensive terms. Besides, it is an open challenge for natural language processing (NLP) to detect such terminologies automatically. Substantial efforts are made for high-resource languages such as English. However, it becomes more challenging when dealing with resource-poor languages such as Urdu. Because of the lack of standard datasets and pre-processing tools for automatic offensive terms detection. This paper introduces a combinatorial pre-processing approach in developing a classification model for cross-platform (Twitter and YouTube) use. The approach uses datasets from two different platforms (Twitter and YouTube) the training and testing the model, which is trained to apply decision tree, random forest and naive Bayes algorithms. The proposed combinatorial pre-processing approach is applied to check how machine learning models behave with different combinations of standard pre-processing techniques for low-resource language in the cross-platform setting. The experimental results represent the effectiveness of the machine learning model over different subsets of traditional pre-processing approaches in building a classification model for automatic offensive terms detection for a low resource language, i.e., Urdu, in the cross-platform scenario. In the experiments, when dataset D1 is used for training and D2 is applied for testing, the pre-processing approach named Stopword removal produced better results with an accuracy of 83.27%. Whilst, in this case, when dataset D2 is used for training and D1 is applied for testing, stopword removal and punctuation removal were observed as a better preprocessing approach with an accuracy of 74.54%. The combinatorial approach proposed in this paper outperformed the benchmark for the considered datasets using classical as well as ensemble machine learning with an accuracy of 82.9% and 97.2% for dataset D1 and D2, respectively. PeerJ Inc. 2023-08-29 /pmc/articles/PMC10496005/ /pubmed/37705647 http://dx.doi.org/10.7717/peerj-cs.1524 Text en ©2023 Raza et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Algorithms and Analysis of Algorithms Raza, Muhammad Owais Mahoto, Naeem Ahmed Hamdi, Mohammed Reshan, Mana Saleh Al Rajab, Adel Shaikh, Asadullah Detection of offensive terms in resource-poor language using machine learning algorithms
title	Detection of offensive terms in resource-poor language using machine learning algorithms
title_full	Detection of offensive terms in resource-poor language using machine learning algorithms
title_fullStr	Detection of offensive terms in resource-poor language using machine learning algorithms
title_full_unstemmed	Detection of offensive terms in resource-poor language using machine learning algorithms
title_short	Detection of offensive terms in resource-poor language using machine learning algorithms
title_sort	detection of offensive terms in resource-poor language using machine learning algorithms
topic	Algorithms and Analysis of Algorithms
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10496005/ https://www.ncbi.nlm.nih.gov/pubmed/37705647 http://dx.doi.org/10.7717/peerj-cs.1524
work_keys_str_mv	AT razamuhammadowais detectionofoffensivetermsinresourcepoorlanguageusingmachinelearningalgorithms AT mahotonaeemahmed detectionofoffensivetermsinresourcepoorlanguageusingmachinelearningalgorithms AT hamdimohammed detectionofoffensivetermsinresourcepoorlanguageusingmachinelearningalgorithms AT reshanmanasalehal detectionofoffensivetermsinresourcepoorlanguageusingmachinelearningalgorithms AT rajabadel detectionofoffensivetermsinresourcepoorlanguageusingmachinelearningalgorithms AT shaikhasadullah detectionofoffensivetermsinresourcepoorlanguageusingmachinelearningalgorithms

Detection of offensive terms in resource-poor language using machine learning algorithms

Ejemplares similares