Cargando…

Investigating cross-lingual training for offensive language detection

Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the Englis...

Descripción completa

Detalles Bibliográficos
Autores principales: Pelicon, Andraž, Shekhar, Ravi, Škrlj, Blaž, Purver, Matthew, Pollak, Senja
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8237322/
https://www.ncbi.nlm.nih.gov/pubmed/34239970
http://dx.doi.org/10.7717/peerj-cs.559
_version_ 1783714707945816064
author Pelicon, Andraž
Shekhar, Ravi
Škrlj, Blaž
Purver, Matthew
Pollak, Senja
author_facet Pelicon, Andraž
Shekhar, Ravi
Škrlj, Blaž
Purver, Matthew
Pollak, Senja
author_sort Pelicon, Andraž
collection PubMed
description Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the English language, and are hard to apply directly to languages in which few labeled datasets exist. Recent work has therefore investigated the use of cross-lingual transfer learning to solve this problem, training a model in a well-resourced language and transferring to a less-resourced target language; but performance has so far been significantly less impressive. In this paper, we investigate the reasons for this performance drop, via a systematic comparison of pre-trained models and intermediate training regimes on five different languages. We show that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. We then use multiple analyses of classifier confidence and language model vocabulary to shed light on exactly where these gains come from and gain insight into the sources of the most typical mistakes.
format Online
Article
Text
id pubmed-8237322
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-82373222021-07-07 Investigating cross-lingual training for offensive language detection Pelicon, Andraž Shekhar, Ravi Škrlj, Blaž Purver, Matthew Pollak, Senja PeerJ Comput Sci Computational Linguistics Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the English language, and are hard to apply directly to languages in which few labeled datasets exist. Recent work has therefore investigated the use of cross-lingual transfer learning to solve this problem, training a model in a well-resourced language and transferring to a less-resourced target language; but performance has so far been significantly less impressive. In this paper, we investigate the reasons for this performance drop, via a systematic comparison of pre-trained models and intermediate training regimes on five different languages. We show that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. We then use multiple analyses of classifier confidence and language model vocabulary to shed light on exactly where these gains come from and gain insight into the sources of the most typical mistakes. PeerJ Inc. 2021-06-25 /pmc/articles/PMC8237322/ /pubmed/34239970 http://dx.doi.org/10.7717/peerj-cs.559 Text en © 2021 Pelicon et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Computational Linguistics
Pelicon, Andraž
Shekhar, Ravi
Škrlj, Blaž
Purver, Matthew
Pollak, Senja
Investigating cross-lingual training for offensive language detection
title Investigating cross-lingual training for offensive language detection
title_full Investigating cross-lingual training for offensive language detection
title_fullStr Investigating cross-lingual training for offensive language detection
title_full_unstemmed Investigating cross-lingual training for offensive language detection
title_short Investigating cross-lingual training for offensive language detection
title_sort investigating cross-lingual training for offensive language detection
topic Computational Linguistics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8237322/
https://www.ncbi.nlm.nih.gov/pubmed/34239970
http://dx.doi.org/10.7717/peerj-cs.559
work_keys_str_mv AT peliconandraz investigatingcrosslingualtrainingforoffensivelanguagedetection
AT shekharravi investigatingcrosslingualtrainingforoffensivelanguagedetection
AT skrljblaz investigatingcrosslingualtrainingforoffensivelanguagedetection
AT purvermatthew investigatingcrosslingualtrainingforoffensivelanguagedetection
AT pollaksenja investigatingcrosslingualtrainingforoffensivelanguagedetection