Cargando…

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube commen...

Descripción completa

Detalles Bibliográficos
Autores principales: Chakravarthi, Bharathi Raja, Priyadharshini, Ruba, Muralidaran, Vigneshwaran, Jose, Navya, Suryawanshi, Shardul, Sherly, Elizabeth, McCrae, John P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Netherlands 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449/
https://www.ncbi.nlm.nih.gov/pubmed/35996566
http://dx.doi.org/10.1007/s10579-022-09583-7
_version_ 1784770228450754560
author Chakravarthi, Bharathi Raja
Priyadharshini, Ruba
Muralidaran, Vigneshwaran
Jose, Navya
Suryawanshi, Shardul
Sherly, Elizabeth
McCrae, John P.
author_facet Chakravarthi, Bharathi Raja
Priyadharshini, Ruba
Muralidaran, Vigneshwaran
Jose, Navya
Suryawanshi, Shardul
Sherly, Elizabeth
McCrae, John P.
author_sort Chakravarthi, Bharathi Raja
collection PubMed
description This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.
format Online
Article
Text
id pubmed-9388449
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer Netherlands
record_format MEDLINE/PubMed
spelling pubmed-93884492022-08-20 DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P. Lang Resour Eval Original Paper This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo. Springer Netherlands 2022-02-04 2022 /pmc/articles/PMC9388449/ /pubmed/35996566 http://dx.doi.org/10.1007/s10579-022-09583-7 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Original Paper
Chakravarthi, Bharathi Raja
Priyadharshini, Ruba
Muralidaran, Vigneshwaran
Jose, Navya
Suryawanshi, Shardul
Sherly, Elizabeth
McCrae, John P.
DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_full DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_fullStr DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_full_unstemmed DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_short DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_sort dravidiancodemix: sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449/
https://www.ncbi.nlm.nih.gov/pubmed/35996566
http://dx.doi.org/10.1007/s10579-022-09583-7
work_keys_str_mv AT chakravarthibharathiraja dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext
AT priyadharshiniruba dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext
AT muralidaranvigneshwaran dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext
AT josenavya dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext
AT suryawanshishardul dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext
AT sherlyelizabeth dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext
AT mccraejohnp dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext