Cargando…

Curating Cyberbullying Datasets: a Human-AI Collaborative Approach

Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullyi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gomez, Christopher E., Sztainberg, Marcelo O., Trana, Rachel E.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8691962/ https://www.ncbi.nlm.nih.gov/pubmed/34957375 http://dx.doi.org/10.1007/s42380-021-00114-6

_version_	1784618858187849728
author	Gomez, Christopher E. Sztainberg, Marcelo O. Trana, Rachel E.
author_facet	Gomez, Christopher E. Sztainberg, Marcelo O. Trana, Rachel E.
author_sort	Gomez, Christopher E.
collection	PubMed
description	Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullying is the availability of accurately annotated, reliable, relevant, and diverse datasets. Datasets intended to train models for cyberbullying detection are typically annotated by human participants, which can introduce the following issues: (1) annotator bias, (2) incorrect annotation due to language and cultural barriers, and (3) the inherent subjectivity of the task can naturally create multiple valid labels for a given comment. The result can be a potentially inadequate dataset with one or more of these overlapping issues. We propose two machine learning approaches to identify and filter unambiguous comments in a cyberbullying dataset of roughly 19,000 comments collected from YouTube that was initially annotated using Amazon Mechanical Turk (AMT). Using consensus filtering methods, comments were classified as unambiguous when an agreement occurred between the AMT workers’ majority label and the unanimous algorithmic filtering label. Comments identified as unambiguous were extracted and used to curate new datasets. We then used an artificial neural network to test for performance on these datasets. Compared to the original dataset, the classifier exhibits a large improvement in performance on modified versions of the dataset and can yield insight into the type of data that is consistently classified as bullying or non-bullying. This annotation approach can be expanded from cyberbullying datasets onto any classification corpus that has a similar complexity in scope.
format	Online Article Text
id	pubmed-8691962
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-86919622021-12-22 Curating Cyberbullying Datasets: a Human-AI Collaborative Approach Gomez, Christopher E. Sztainberg, Marcelo O. Trana, Rachel E. Int J Bullying Prev Original Article Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullying is the availability of accurately annotated, reliable, relevant, and diverse datasets. Datasets intended to train models for cyberbullying detection are typically annotated by human participants, which can introduce the following issues: (1) annotator bias, (2) incorrect annotation due to language and cultural barriers, and (3) the inherent subjectivity of the task can naturally create multiple valid labels for a given comment. The result can be a potentially inadequate dataset with one or more of these overlapping issues. We propose two machine learning approaches to identify and filter unambiguous comments in a cyberbullying dataset of roughly 19,000 comments collected from YouTube that was initially annotated using Amazon Mechanical Turk (AMT). Using consensus filtering methods, comments were classified as unambiguous when an agreement occurred between the AMT workers’ majority label and the unanimous algorithmic filtering label. Comments identified as unambiguous were extracted and used to curate new datasets. We then used an artificial neural network to test for performance on these datasets. Compared to the original dataset, the classifier exhibits a large improvement in performance on modified versions of the dataset and can yield insight into the type of data that is consistently classified as bullying or non-bullying. This annotation approach can be expanded from cyberbullying datasets onto any classification corpus that has a similar complexity in scope. Springer International Publishing 2021-12-22 2022 /pmc/articles/PMC8691962/ /pubmed/34957375 http://dx.doi.org/10.1007/s42380-021-00114-6 Text en © The Author(s), under exclusive licence to Springer Nature Switzerland AG 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Original Article Gomez, Christopher E. Sztainberg, Marcelo O. Trana, Rachel E. Curating Cyberbullying Datasets: a Human-AI Collaborative Approach
title	Curating Cyberbullying Datasets: a Human-AI Collaborative Approach
title_full	Curating Cyberbullying Datasets: a Human-AI Collaborative Approach
title_fullStr	Curating Cyberbullying Datasets: a Human-AI Collaborative Approach
title_full_unstemmed	Curating Cyberbullying Datasets: a Human-AI Collaborative Approach
title_short	Curating Cyberbullying Datasets: a Human-AI Collaborative Approach
title_sort	curating cyberbullying datasets: a human-ai collaborative approach
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8691962/ https://www.ncbi.nlm.nih.gov/pubmed/34957375 http://dx.doi.org/10.1007/s42380-021-00114-6
work_keys_str_mv	AT gomezchristophere curatingcyberbullyingdatasetsahumanaicollaborativeapproach AT sztainbergmarceloo curatingcyberbullyingdatasetsahumanaicollaborativeapproach AT tranarachele curatingcyberbullyingdatasetsahumanaicollaborativeapproach

Curating Cyberbullying Datasets: a Human-AI Collaborative Approach

Ejemplares similares