Cargando…

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube commen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chakravarthi, Bharathi Raja, Priyadharshini, Ruba, Muralidaran, Vigneshwaran, Jose, Navya, Suryawanshi, Shardul, Sherly, Elizabeth, McCrae, John P.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Netherlands 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449/ https://www.ncbi.nlm.nih.gov/pubmed/35996566 http://dx.doi.org/10.1007/s10579-022-09583-7

_version_	1784770228450754560
author	Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P.
author_facet	Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P.
author_sort	Chakravarthi, Bharathi Raja
collection	PubMed
description	This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.
format	Online Article Text
id	pubmed-9388449
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer Netherlands
record_format	MEDLINE/PubMed
spelling	pubmed-93884492022-08-20 DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P. Lang Resour Eval Original Paper This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo. Springer Netherlands 2022-02-04 2022 /pmc/articles/PMC9388449/ /pubmed/35996566 http://dx.doi.org/10.1007/s10579-022-09583-7 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Original Paper Chakravarthi, Bharathi Raja Priyadharshini, Ruba Muralidaran, Vigneshwaran Jose, Navya Suryawanshi, Shardul Sherly, Elizabeth McCrae, John P. DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title	DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_full	DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_fullStr	DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_full_unstemmed	DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_short	DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
title_sort	dravidiancodemix: sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9388449/ https://www.ncbi.nlm.nih.gov/pubmed/35996566 http://dx.doi.org/10.1007/s10579-022-09583-7
work_keys_str_mv	AT chakravarthibharathiraja dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT priyadharshiniruba dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT muralidaranvigneshwaran dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT josenavya dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT suryawanshishardul dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT sherlyelizabeth dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext AT mccraejohnp dravidiancodemixsentimentanalysisandoffensivelanguageidentificationdatasetfordravidianlanguagesincodemixedtext

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Ejemplares similares