Cargando…

Multi-label emotion classification of Urdu tweets

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to...

Descripción completa

Detalles Bibliográficos
Autores principales: Ashraf, Noman, Khan, Lal, Butt, Sabur, Chang, Hsien-Tsung, Sidorov, Grigori, Gelbukh, Alexander
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044368/
https://www.ncbi.nlm.nih.gov/pubmed/35494831
http://dx.doi.org/10.7717/peerj-cs.896
_version_ 1784695090286952448
author Ashraf, Noman
Khan, Lal
Butt, Sabur
Chang, Hsien-Tsung
Sidorov, Grigori
Gelbukh, Alexander
author_facet Ashraf, Noman
Khan, Lal
Butt, Sabur
Chang, Hsien-Tsung
Sidorov, Grigori
Gelbukh, Alexander
author_sort Ashraf, Noman
collection PubMed
description Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
format Online
Article
Text
id pubmed-9044368
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-90443682022-04-28 Multi-label emotion classification of Urdu tweets Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander PeerJ Comput Sci Computational Linguistics Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods. PeerJ Inc. 2022-04-22 /pmc/articles/PMC9044368/ /pubmed/35494831 http://dx.doi.org/10.7717/peerj-cs.896 Text en © 2022 Ashraf et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Computational Linguistics
Ashraf, Noman
Khan, Lal
Butt, Sabur
Chang, Hsien-Tsung
Sidorov, Grigori
Gelbukh, Alexander
Multi-label emotion classification of Urdu tweets
title Multi-label emotion classification of Urdu tweets
title_full Multi-label emotion classification of Urdu tweets
title_fullStr Multi-label emotion classification of Urdu tweets
title_full_unstemmed Multi-label emotion classification of Urdu tweets
title_short Multi-label emotion classification of Urdu tweets
title_sort multi-label emotion classification of urdu tweets
topic Computational Linguistics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044368/
https://www.ncbi.nlm.nih.gov/pubmed/35494831
http://dx.doi.org/10.7717/peerj-cs.896
work_keys_str_mv AT ashrafnoman multilabelemotionclassificationofurdutweets
AT khanlal multilabelemotionclassificationofurdutweets
AT buttsabur multilabelemotionclassificationofurdutweets
AT changhsientsung multilabelemotionclassificationofurdutweets
AT sidorovgrigori multilabelemotionclassificationofurdutweets
AT gelbukhalexander multilabelemotionclassificationofurdutweets