Cargando…

Directions in abusive language training data, a systematic review: Garbage in, garbage out

Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on...

Descripción completa

Detalles Bibliográficos
Autores principales: Vidgen, Bertie, Derczynski, Leon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7769249/
https://www.ncbi.nlm.nih.gov/pubmed/33370298
http://dx.doi.org/10.1371/journal.pone.0243300
_version_ 1783629281530740736
author Vidgen, Bertie
Derczynski, Leon
author_facet Vidgen, Bertie
Derczynski, Leon
author_sort Vidgen, Bertie
collection PubMed
description Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets.
format Online
Article
Text
id pubmed-7769249
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-77692492021-01-08 Directions in abusive language training data, a systematic review: Garbage in, garbage out Vidgen, Bertie Derczynski, Leon PLoS One Research Article Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets. Public Library of Science 2020-12-28 /pmc/articles/PMC7769249/ /pubmed/33370298 http://dx.doi.org/10.1371/journal.pone.0243300 Text en © 2020 Vidgen, Derczynski http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Vidgen, Bertie
Derczynski, Leon
Directions in abusive language training data, a systematic review: Garbage in, garbage out
title Directions in abusive language training data, a systematic review: Garbage in, garbage out
title_full Directions in abusive language training data, a systematic review: Garbage in, garbage out
title_fullStr Directions in abusive language training data, a systematic review: Garbage in, garbage out
title_full_unstemmed Directions in abusive language training data, a systematic review: Garbage in, garbage out
title_short Directions in abusive language training data, a systematic review: Garbage in, garbage out
title_sort directions in abusive language training data, a systematic review: garbage in, garbage out
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7769249/
https://www.ncbi.nlm.nih.gov/pubmed/33370298
http://dx.doi.org/10.1371/journal.pone.0243300
work_keys_str_mv AT vidgenbertie directionsinabusivelanguagetrainingdataasystematicreviewgarbageingarbageout
AT derczynskileon directionsinabusivelanguagetrainingdataasystematicreviewgarbageingarbageout