Cargando…
DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube commen...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer Netherlands
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449/ https://www.ncbi.nlm.nih.gov/pubmed/35996566 http://dx.doi.org/10.1007/s10579-022-09583-7 |
_version_ | 1784770228450754560 |
---|---|
author | Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P. |
author_facet | Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P. |
author_sort | Chakravarthi, Bharathi Raja |
collection | PubMed |
description | This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo. |
format | Online Article Text |
id | pubmed-9388449 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer Netherlands |
record_format | MEDLINE/PubMed |
spelling | pubmed-93884492022-08-20 DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P. Lang Resour Eval Original Paper This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo. Springer Netherlands 2022-02-04 2022 /pmc/articles/PMC9388449/ /pubmed/35996566 http://dx.doi.org/10.1007/s10579-022-09583-7 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Original Paper Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P. DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text |
title | DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text |
title_full | DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text |
title_fullStr | DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text |
title_full_unstemmed | DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text |
title_short | DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text |
title_sort | dravidiancodemix: sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449/ https://www.ncbi.nlm.nih.gov/pubmed/35996566 http://dx.doi.org/10.1007/s10579-022-09583-7 |
work_keys_str_mv | AT chakravarthibharathiraja dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT priyadharshiniruba dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT muralidaranvigneshwaran dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT josenavya dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT suryawanshishardul dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT sherlyelizabeth dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT mccraejohnp dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext |