Cargando…

ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment

Toxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed us...

Descripción completa

Detalles Bibliográficos
Autor principal:	Rashid, Mohammad Mamun Or
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2022
Materias:	Data Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9256543/ https://www.ncbi.nlm.nih.gov/pubmed/35811647 http://dx.doi.org/10.1016/j.dib.2022.108416

_version_	1784741137044471808
author	Rashid, Mohammad Mamun Or
author_facet	Rashid, Mohammad Mamun Or
author_sort	Rashid, Mohammad Mamun Or
collection	PubMed
description	Toxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed using user-generated content from Facebook and that will cover the demographic and thematic distribution of Bangla toxic language generated on the web. Therefore, 2207590 comments have been collected, annotated, and thus extract about 1959 unique bigrams as utterances, which were considered as base-entry of a toxic language dataset. The core derivatives of the dataset are bigram-based wordlists, which are annotated inductively and divided into 08 thematic classes that give some ideas on toxicity variations found in the Bengali community. These thematic classes cover political hate speech [3] and misogynist bullies dominantly. However, these thematic labels will serve as classifiers in the text classification process through machine learning. In addition to the thematic classification labels, this dataset includes some additional features such as imprecise meanings in English, IPA transliteration, real occurrences in the source pages, spelling standards, and degree of toxicity. As this is a dataset of utterance, it has de-identified and anonymous entries and no difficulties for public disclosure. Therefore, we consider this dataset as Toxic lexicon (Toxlex) as an exhaustive wordlist that is essentially a curated value-added and analyzed dataset which can be used as classifier material to detect toxicity in social media.
format	Online Article Text
id	pubmed-9256543
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-92565432022-07-07 ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment Rashid, Mohammad Mamun Or Data Brief Data Article Toxic Language in social media is a newly emerging virtual disorder of human society. Detecting toxic language is an NLP task that requires a Dataset of utterances [1]. For the Bangla language, very few datasets have been developed on toxicity or similar concepts [2]. A dataset has been developed using user-generated content from Facebook and that will cover the demographic and thematic distribution of Bangla toxic language generated on the web. Therefore, 2207590 comments have been collected, annotated, and thus extract about 1959 unique bigrams as utterances, which were considered as base-entry of a toxic language dataset. The core derivatives of the dataset are bigram-based wordlists, which are annotated inductively and divided into 08 thematic classes that give some ideas on toxicity variations found in the Bengali community. These thematic classes cover political hate speech [3] and misogynist bullies dominantly. However, these thematic labels will serve as classifiers in the text classification process through machine learning. In addition to the thematic classification labels, this dataset includes some additional features such as imprecise meanings in English, IPA transliteration, real occurrences in the source pages, spelling standards, and degree of toxicity. As this is a dataset of utterance, it has de-identified and anonymous entries and no difficulties for public disclosure. Therefore, we consider this dataset as Toxic lexicon (Toxlex) as an exhaustive wordlist that is essentially a curated value-added and analyzed dataset which can be used as classifier material to detect toxicity in social media. Elsevier 2022-06-24 /pmc/articles/PMC9256543/ /pubmed/35811647 http://dx.doi.org/10.1016/j.dib.2022.108416 Text en © 2022 The Author(s). Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Data Article Rashid, Mohammad Mamun Or ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title	ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_full	ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_fullStr	ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_full_unstemmed	ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_short	ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment
title_sort	toxlex_bn: a curated dataset of bangla toxic language derived from facebook comment
topic	Data Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9256543/ https://www.ncbi.nlm.nih.gov/pubmed/35811647 http://dx.doi.org/10.1016/j.dib.2022.108416
work_keys_str_mv	AT rashidmohammadmamunor toxlexbnacurateddatasetofbanglatoxiclanguagederivedfromfacebookcomment

ToxLex_bn: A curated dataset of bangla toxic language derived from Facebook comment

Ejemplares similares