Cargando…
Multi-label emotion classification of Urdu tweets
Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
PeerJ Inc.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044368/ https://www.ncbi.nlm.nih.gov/pubmed/35494831 http://dx.doi.org/10.7717/peerj-cs.896 |
_version_ | 1784695090286952448 |
---|---|
author | Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander |
author_facet | Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander |
author_sort | Ashraf, Noman |
collection | PubMed |
description | Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods. |
format | Online Article Text |
id | pubmed-9044368 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | PeerJ Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-90443682022-04-28 Multi-label emotion classification of Urdu tweets Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander PeerJ Comput Sci Computational Linguistics Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods. PeerJ Inc. 2022-04-22 /pmc/articles/PMC9044368/ /pubmed/35494831 http://dx.doi.org/10.7717/peerj-cs.896 Text en © 2022 Ashraf et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited. |
spellingShingle | Computational Linguistics Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander Multi-label emotion classification of Urdu tweets |
title | Multi-label emotion classification of Urdu tweets |
title_full | Multi-label emotion classification of Urdu tweets |
title_fullStr | Multi-label emotion classification of Urdu tweets |
title_full_unstemmed | Multi-label emotion classification of Urdu tweets |
title_short | Multi-label emotion classification of Urdu tweets |
title_sort | multi-label emotion classification of urdu tweets |
topic | Computational Linguistics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044368/ https://www.ncbi.nlm.nih.gov/pubmed/35494831 http://dx.doi.org/10.7717/peerj-cs.896 |
work_keys_str_mv | AT ashrafnoman multilabelemotionclassificationofurdutweets AT khanlal multilabelemotionclassificationofurdutweets AT buttsabur multilabelemotionclassificationofurdutweets AT changhsientsung multilabelemotionclassificationofurdutweets AT sidorovgrigori multilabelemotionclassificationofurdutweets AT gelbukhalexander multilabelemotionclassificationofurdutweets |