Cargando…

AraCust: a Saudi Telecom Tweets corpus for sentiment analysis

Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper prese...

Descripción completa

Detalles Bibliográficos
Autores principales: Almuqren, Latifah, Cristea, Alexandra
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8157250/
https://www.ncbi.nlm.nih.gov/pubmed/34084924
http://dx.doi.org/10.7717/peerj-cs.510
_version_ 1783699640523161600
author Almuqren, Latifah
Cristea, Alexandra
author_facet Almuqren, Latifah
Cristea, Alexandra
author_sort Almuqren, Latifah
collection PubMed
description Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust’s power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission.
format Online
Article
Text
id pubmed-8157250
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-81572502021-06-02 AraCust: a Saudi Telecom Tweets corpus for sentiment analysis Almuqren, Latifah Cristea, Alexandra PeerJ Comput Sci Data Mining and Machine Learning Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust’s power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission. PeerJ Inc. 2021-05-20 /pmc/articles/PMC8157250/ /pubmed/34084924 http://dx.doi.org/10.7717/peerj-cs.510 Text en ©2021 Almuqren and Cristea https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Data Mining and Machine Learning
Almuqren, Latifah
Cristea, Alexandra
AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_full AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_fullStr AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_full_unstemmed AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_short AraCust: a Saudi Telecom Tweets corpus for sentiment analysis
title_sort aracust: a saudi telecom tweets corpus for sentiment analysis
topic Data Mining and Machine Learning
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8157250/
https://www.ncbi.nlm.nih.gov/pubmed/34084924
http://dx.doi.org/10.7717/peerj-cs.510
work_keys_str_mv AT almuqrenlatifah aracustasauditelecomtweetscorpusforsentimentanalysis
AT cristeaalexandra aracustasauditelecomtweetscorpusforsentimentanalysis