Cargando…
Investigating cross-lingual training for offensive language detection
Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the Englis...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8237322/ https://www.ncbi.nlm.nih.gov/pubmed/34239970 http://dx.doi.org/10.7717/peerj-cs.559 |
_version_ | 1783714707945816064 |
---|---|
author | Pelicon, Andraž Shekhar, Ravi Škrlj, Blaž Purver, Matthew Pollak, Senja |
author_facet | Pelicon, Andraž Shekhar, Ravi Škrlj, Blaž Purver, Matthew Pollak, Senja |
author_sort | Pelicon, Andraž |
collection | PubMed |
description | Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the English language, and are hard to apply directly to languages in which few labeled datasets exist. Recent work has therefore investigated the use of cross-lingual transfer learning to solve this problem, training a model in a well-resourced language and transferring to a less-resourced target language; but performance has so far been significantly less impressive. In this paper, we investigate the reasons for this performance drop, via a systematic comparison of pre-trained models and intermediate training regimes on five different languages. We show that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. We then use multiple analyses of classifier confidence and language model vocabulary to shed light on exactly where these gains come from and gain insight into the sources of the most typical mistakes. |
format | Online Article Text |
id | pubmed-8237322 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-82373222021-07-07 Investigating cross-lingual training for offensive language detection Pelicon, Andraž Shekhar, Ravi Škrlj, Blaž Purver, Matthew Pollak, Senja PeerJ Comput Sci Computational Linguistics Platforms that feature user-generated content (social media, online forums, newspaper comment sections etc.) have to detect and filter offensive speech within large, fast-changing datasets. While many automatic methods have been proposed and achieve good accuracies, most of these focus on the English language, and are hard to apply directly to languages in which few labeled datasets exist. Recent work has therefore investigated the use of cross-lingual transfer learning to solve this problem, training a model in a well-resourced language and transferring to a less-resourced target language; but performance has so far been significantly less impressive. In this paper, we investigate the reasons for this performance drop, via a systematic comparison of pre-trained models and intermediate training regimes on five different languages. We show that using a better pre-trained language model results in a large gain in overall performance and in zero-shot transfer, and that intermediate training on other languages is effective when little target-language data is available. We then use multiple analyses of classifier confidence and language model vocabulary to shed light on exactly where these gains come from and gain insight into the sources of the most typical mistakes. PeerJ Inc. 2021-06-25 /pmc/articles/PMC8237322/ /pubmed/34239970 http://dx.doi.org/10.7717/peerj-cs.559 Text en © 2021 Pelicon et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Computational Linguistics Pelicon, Andraž Shekhar, Ravi Škrlj, Blaž Purver, Matthew Pollak, Senja Investigating cross-lingual training for offensive language detection |
title | Investigating cross-lingual training for offensive language detection |
title_full | Investigating cross-lingual training for offensive language detection |
title_fullStr | Investigating cross-lingual training for offensive language detection |
title_full_unstemmed | Investigating cross-lingual training for offensive language detection |
title_short | Investigating cross-lingual training for offensive language detection |
title_sort | investigating cross-lingual training for offensive language detection |
topic | Computational Linguistics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8237322/ https://www.ncbi.nlm.nih.gov/pubmed/34239970 http://dx.doi.org/10.7717/peerj-cs.559 |
work_keys_str_mv | AT peliconandraz investigatingcrosslingualtrainingforoffensivelanguagedetection AT shekharravi investigatingcrosslingualtrainingforoffensivelanguagedetection AT skrljblaz investigatingcrosslingualtrainingforoffensivelanguagedetection AT purvermatthew investigatingcrosslingualtrainingforoffensivelanguagedetection AT pollaksenja investigatingcrosslingualtrainingforoffensivelanguagedetection |