Cargando…

Identifying vulgarity in Bengali social media textual content

The presence of abusive and vulgar language in social media has become an issue of increasing concern in recent years. However, research pertaining to the prevalence and identification of vulgar language has remained largely unexplored in low-resource languages such as Bengali. In this paper, we pro...

Descripción completa

Detalles Bibliográficos
Autor principal:	Sazzed, Salim
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2021
Materias:	Computational Linguistics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8576541/ https://www.ncbi.nlm.nih.gov/pubmed/34805498 http://dx.doi.org/10.7717/peerj-cs.665

_version_	1784595897220333568
author	Sazzed, Salim
author_facet	Sazzed, Salim
author_sort	Sazzed, Salim
collection	PubMed
description	The presence of abusive and vulgar language in social media has become an issue of increasing concern in recent years. However, research pertaining to the prevalence and identification of vulgar language has remained largely unexplored in low-resource languages such as Bengali. In this paper, we provide the first comprehensive analysis on the presence of vulgarity in Bengali social media content. We develop two benchmark corpora consisting of 7,245 reviews collected from YouTube and manually annotate them into vulgar and non-vulgar categories. The manual annotation reveals the ubiquity of vulgar and swear words in Bengali social media content (i.e., in two corpora), ranging from 20% to 34%. To automatically identify vulgarity, we employ various approaches, such as classical machine learning (CML) classifiers, Stochastic Gradient Descent (SGD) optimizer, a deep learning (DL) based architecture, and lexicon-based methods. Although small in size, we find that the swear/vulgar lexicon is effective at identifying the vulgar language due to the high presence of some swear terms in Bengali social media. We observe that the performances of machine leanings (ML) classifiers are affected by the class distribution of the dataset. The DL-based BiLSTM (Bidirectional Long Short Term Memory) model yields the highest recall scores for identifying vulgarity in both datasets (i.e., in both original and class-balanced settings). Besides, the analysis reveals that vulgarity is highly correlated with negative sentiment in social media comments.
format	Online Article Text
id	pubmed-8576541
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-85765412021-11-19 Identifying vulgarity in Bengali social media textual content Sazzed, Salim PeerJ Comput Sci Computational Linguistics The presence of abusive and vulgar language in social media has become an issue of increasing concern in recent years. However, research pertaining to the prevalence and identification of vulgar language has remained largely unexplored in low-resource languages such as Bengali. In this paper, we provide the first comprehensive analysis on the presence of vulgarity in Bengali social media content. We develop two benchmark corpora consisting of 7,245 reviews collected from YouTube and manually annotate them into vulgar and non-vulgar categories. The manual annotation reveals the ubiquity of vulgar and swear words in Bengali social media content (i.e., in two corpora), ranging from 20% to 34%. To automatically identify vulgarity, we employ various approaches, such as classical machine learning (CML) classifiers, Stochastic Gradient Descent (SGD) optimizer, a deep learning (DL) based architecture, and lexicon-based methods. Although small in size, we find that the swear/vulgar lexicon is effective at identifying the vulgar language due to the high presence of some swear terms in Bengali social media. We observe that the performances of machine leanings (ML) classifiers are affected by the class distribution of the dataset. The DL-based BiLSTM (Bidirectional Long Short Term Memory) model yields the highest recall scores for identifying vulgarity in both datasets (i.e., in both original and class-balanced settings). Besides, the analysis reveals that vulgarity is highly correlated with negative sentiment in social media comments. PeerJ Inc. 2021-10-19 /pmc/articles/PMC8576541/ /pubmed/34805498 http://dx.doi.org/10.7717/peerj-cs.665 Text en © 2021 Sazzed https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Computational Linguistics Sazzed, Salim Identifying vulgarity in Bengali social media textual content
title	Identifying vulgarity in Bengali social media textual content
title_full	Identifying vulgarity in Bengali social media textual content
title_fullStr	Identifying vulgarity in Bengali social media textual content
title_full_unstemmed	Identifying vulgarity in Bengali social media textual content
title_short	Identifying vulgarity in Bengali social media textual content
title_sort	identifying vulgarity in bengali social media textual content
topic	Computational Linguistics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8576541/ https://www.ncbi.nlm.nih.gov/pubmed/34805498 http://dx.doi.org/10.7717/peerj-cs.665
work_keys_str_mv	AT sazzedsalim identifyingvulgarityinbengalisocialmediatextualcontent

Identifying vulgarity in Bengali social media textual content

Ejemplares similares