Cargando…

A curated dataset for hate speech detection on social media text

Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. Therefore...

Descripción completa

Detalles Bibliográficos
Autores principales: Mody, Devansh, Huang, YiDong, Alves de Oliveira, Thiago Eustaquio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9807815/
https://www.ncbi.nlm.nih.gov/pubmed/36605500
http://dx.doi.org/10.1016/j.dib.2022.108832
_version_ 1784862794476158976
author Mody, Devansh
Huang, YiDong
Alves de Oliveira, Thiago Eustaquio
author_facet Mody, Devansh
Huang, YiDong
Alves de Oliveira, Thiago Eustaquio
author_sort Mody, Devansh
collection PubMed
description Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. Therefore, our dataset is curated from various sources like Kaggle, GitHub, and other websites. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content. It has 451,709 sentences in total. 371,452 of these are hate speech, and 80,250 are non-hate speech. An augmented balanced dataset with 726,120 samples is also generated to create a custom vocabulary of 145,046 words. The total number of contractions considered in the dataset is 6403. The total number of bad words usually used in hateful content is 377. The text in each sentence of the final dataset, which is utilized for training and cross-validation, is limited to 180 words. The generated contractions dataset can be used for any projects in the area of NLP for data preprocessing. The augmented dataset can help to reduce the number of out-of-vocabulary words, and the hate speech dataset can be used as a classifier to detect hate or no hate on social media platforms.
format Online
Article
Text
id pubmed-9807815
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-98078152023-01-04 A curated dataset for hate speech detection on social media text Mody, Devansh Huang, YiDong Alves de Oliveira, Thiago Eustaquio Data Brief Data Article Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. Therefore, our dataset is curated from various sources like Kaggle, GitHub, and other websites. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content. It has 451,709 sentences in total. 371,452 of these are hate speech, and 80,250 are non-hate speech. An augmented balanced dataset with 726,120 samples is also generated to create a custom vocabulary of 145,046 words. The total number of contractions considered in the dataset is 6403. The total number of bad words usually used in hateful content is 377. The text in each sentence of the final dataset, which is utilized for training and cross-validation, is limited to 180 words. The generated contractions dataset can be used for any projects in the area of NLP for data preprocessing. The augmented dataset can help to reduce the number of out-of-vocabulary words, and the hate speech dataset can be used as a classifier to detect hate or no hate on social media platforms. Elsevier 2022-12-17 /pmc/articles/PMC9807815/ /pubmed/36605500 http://dx.doi.org/10.1016/j.dib.2022.108832 Text en © 2022 The Authors. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Mody, Devansh
Huang, YiDong
Alves de Oliveira, Thiago Eustaquio
A curated dataset for hate speech detection on social media text
title A curated dataset for hate speech detection on social media text
title_full A curated dataset for hate speech detection on social media text
title_fullStr A curated dataset for hate speech detection on social media text
title_full_unstemmed A curated dataset for hate speech detection on social media text
title_short A curated dataset for hate speech detection on social media text
title_sort curated dataset for hate speech detection on social media text
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9807815/
https://www.ncbi.nlm.nih.gov/pubmed/36605500
http://dx.doi.org/10.1016/j.dib.2022.108832
work_keys_str_mv AT modydevansh acurateddatasetforhatespeechdetectiononsocialmediatext
AT huangyidong acurateddatasetforhatespeechdetectiononsocialmediatext
AT alvesdeoliveirathiagoeustaquio acurateddatasetforhatespeechdetectiononsocialmediatext
AT modydevansh curateddatasetforhatespeechdetectiononsocialmediatext
AT huangyidong curateddatasetforhatespeechdetectiononsocialmediatext
AT alvesdeoliveirathiagoeustaquio curateddatasetforhatespeechdetectiononsocialmediatext