Cargando…
The design, construction and evaluation of annotated Arabic cyberbullying corpus
Cyberbullying (CB) is classified as one of the severe misconducts on social media. Many CB detection systems have been developed for many natural languages to face this phenomenon. However, Arabic is one of the under-resourced languages suffering from the lack of quality datasets in many computation...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer US
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9046013/ https://www.ncbi.nlm.nih.gov/pubmed/35502160 http://dx.doi.org/10.1007/s10639-022-11056-x |
_version_ | 1784695433513140224 |
---|---|
author | Shannag, Fatima Hammo, Bassam H. Faris, Hossam |
author_facet | Shannag, Fatima Hammo, Bassam H. Faris, Hossam |
author_sort | Shannag, Fatima |
collection | PubMed |
description | Cyberbullying (CB) is classified as one of the severe misconducts on social media. Many CB detection systems have been developed for many natural languages to face this phenomenon. However, Arabic is one of the under-resourced languages suffering from the lack of quality datasets in many computational research areas. This paper discusses the design, construction, and evaluation of a multi-dialect, annotated Arabic Cyberbullying Corpus (ArCybC), a valuable resource for Arabic CB detection and motivation for future research directions in Arabic Natural Language Processing (NLP). The study describes the phases of ArCybC compilation. By way of illustration, it explores the corpus to discover strategies used in rendering Arabic CB tweets pulled from four Twitter groups, including gaming, sports, news, and celebrities. Based on thorough analysis, we discovered that these groups were the most susceptible to harassment and cyberbullying. The collected tweets were filtered based on a compiled harassment lexicon, which contains a list of multi-dialectical profane words in Arabic compiled from four categories: sexual, racial, physical appearance, and intelligence. To annotate ArCybC, we asked five annotators to classify 4,505 tweets into two classes manually: Offensive/non-Offensive and CB/non-CB. We conducted a rigorous comparison of different machine learning approaches applied on ArCybC to detect Arabic CB using two language models: bag-of-words (BoW) and word embedding. The experiments showed that Support Vector Machine (SVM) with word embedding achieved an accuracy rate of 86.3% and an F1-score rate of 85%. The main challenges encountered during the ArCybC construction were the scarcity of freely available Arabic CB texts and the deficiency of annotating the texts. |
format | Online Article Text |
id | pubmed-9046013 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer US |
record_format | MEDLINE/PubMed |
spelling | pubmed-90460132022-04-28 The design, construction and evaluation of annotated Arabic cyberbullying corpus Shannag, Fatima Hammo, Bassam H. Faris, Hossam Educ Inf Technol (Dordr) Article Cyberbullying (CB) is classified as one of the severe misconducts on social media. Many CB detection systems have been developed for many natural languages to face this phenomenon. However, Arabic is one of the under-resourced languages suffering from the lack of quality datasets in many computational research areas. This paper discusses the design, construction, and evaluation of a multi-dialect, annotated Arabic Cyberbullying Corpus (ArCybC), a valuable resource for Arabic CB detection and motivation for future research directions in Arabic Natural Language Processing (NLP). The study describes the phases of ArCybC compilation. By way of illustration, it explores the corpus to discover strategies used in rendering Arabic CB tweets pulled from four Twitter groups, including gaming, sports, news, and celebrities. Based on thorough analysis, we discovered that these groups were the most susceptible to harassment and cyberbullying. The collected tweets were filtered based on a compiled harassment lexicon, which contains a list of multi-dialectical profane words in Arabic compiled from four categories: sexual, racial, physical appearance, and intelligence. To annotate ArCybC, we asked five annotators to classify 4,505 tweets into two classes manually: Offensive/non-Offensive and CB/non-CB. We conducted a rigorous comparison of different machine learning approaches applied on ArCybC to detect Arabic CB using two language models: bag-of-words (BoW) and word embedding. The experiments showed that Support Vector Machine (SVM) with word embedding achieved an accuracy rate of 86.3% and an F1-score rate of 85%. The main challenges encountered during the ArCybC construction were the scarcity of freely available Arabic CB texts and the deficiency of annotating the texts. Springer US 2022-04-28 2022 /pmc/articles/PMC9046013/ /pubmed/35502160 http://dx.doi.org/10.1007/s10639-022-11056-x Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Shannag, Fatima Hammo, Bassam H. Faris, Hossam The design, construction and evaluation of annotated Arabic cyberbullying corpus |
title | The design, construction and evaluation of annotated Arabic cyberbullying corpus |
title_full | The design, construction and evaluation of annotated Arabic cyberbullying corpus |
title_fullStr | The design, construction and evaluation of annotated Arabic cyberbullying corpus |
title_full_unstemmed | The design, construction and evaluation of annotated Arabic cyberbullying corpus |
title_short | The design, construction and evaluation of annotated Arabic cyberbullying corpus |
title_sort | design, construction and evaluation of annotated arabic cyberbullying corpus |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9046013/ https://www.ncbi.nlm.nih.gov/pubmed/35502160 http://dx.doi.org/10.1007/s10639-022-11056-x |
work_keys_str_mv | AT shannagfatima thedesignconstructionandevaluationofannotatedarabiccyberbullyingcorpus AT hammobassamh thedesignconstructionandevaluationofannotatedarabiccyberbullyingcorpus AT farishossam thedesignconstructionandevaluationofannotatedarabiccyberbullyingcorpus AT shannagfatima designconstructionandevaluationofannotatedarabiccyberbullyingcorpus AT hammobassamh designconstructionandevaluationofannotatedarabiccyberbullyingcorpus AT farishossam designconstructionandevaluationofannotatedarabiccyberbullyingcorpus |