Cargando…

Multi-label emotion classification of Urdu tweets

Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ashraf, Noman, Khan, Lal, Butt, Sabur, Chang, Hsien-Tsung, Sidorov, Grigori, Gelbukh, Alexander
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2022
Materias:	Computational Linguistics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044368/ https://www.ncbi.nlm.nih.gov/pubmed/35494831 http://dx.doi.org/10.7717/peerj-cs.896

_version_	1784695090286952448
author	Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander
author_facet	Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander
author_sort	Ashraf, Noman
collection	PubMed
description	Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods.
format	Online Article Text
id	pubmed-9044368
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-90443682022-04-28 Multi-label emotion classification of Urdu tweets Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander PeerJ Comput Sci Computational Linguistics Urdu is a widely used language in South Asia and worldwide. While there are similar datasets available in English, we created the first multi-label emotion dataset consisting of 6,043 tweets and six basic emotions in the Urdu Nastalíq script. A multi-label (ML) classification approach was adopted to detect emotions from Urdu. The morphological and syntactic structure of Urdu makes it a challenging problem for multi-label emotion detection. In this paper, we build a set of baseline classifiers such as machine learning algorithms (Random forest (RF), Decision tree (J48), Sequential minimal optimization (SMO), AdaBoostM1, and Bagging), deep-learning algorithms (Convolutional Neural Networks (1D-CNN), Long short-term memory (LSTM), and LSTM with CNN features) and transformer-based baseline (BERT). We used a combination of text representations: stylometric-based features, pre-trained word embedding, word-based n-grams, and character-based n-grams. The paper highlights the annotation guidelines, dataset characteristics and insights into different methodologies used for Urdu based emotion classification. We present our best results using micro-averaged F1, macro-averaged F1, accuracy, Hamming loss (HL) and exact match (EM) for all tested methods. PeerJ Inc. 2022-04-22 /pmc/articles/PMC9044368/ /pubmed/35494831 http://dx.doi.org/10.7717/peerj-cs.896 Text en © 2022 Ashraf et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Computational Linguistics Ashraf, Noman Khan, Lal Butt, Sabur Chang, Hsien-Tsung Sidorov, Grigori Gelbukh, Alexander Multi-label emotion classification of Urdu tweets
title	Multi-label emotion classification of Urdu tweets
title_full	Multi-label emotion classification of Urdu tweets
title_fullStr	Multi-label emotion classification of Urdu tweets
title_full_unstemmed	Multi-label emotion classification of Urdu tweets
title_short	Multi-label emotion classification of Urdu tweets
title_sort	multi-label emotion classification of urdu tweets
topic	Computational Linguistics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9044368/ https://www.ncbi.nlm.nih.gov/pubmed/35494831 http://dx.doi.org/10.7717/peerj-cs.896
work_keys_str_mv	AT ashrafnoman multilabelemotionclassificationofurdutweets AT khanlal multilabelemotionclassificationofurdutweets AT buttsabur multilabelemotionclassificationofurdutweets AT changhsientsung multilabelemotionclassificationofurdutweets AT sidorovgrigori multilabelemotionclassificationofurdutweets AT gelbukhalexander multilabelemotionclassificationofurdutweets

Multi-label emotion classification of Urdu tweets

Ejemplares similares