Cargando…

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for tr...

Descripción completa

Detalles Bibliográficos
Autores principales: Einea, Omar, Elnagar, Ashraf, Al Debsi, Ridhwan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6700340/
https://www.ncbi.nlm.nih.gov/pubmed/31440535
http://dx.doi.org/10.1016/j.dib.2019.104076
_version_ 1783444853056602112
author Einea, Omar
Elnagar, Ashraf
Al Debsi, Ridhwan
author_facet Einea, Omar
Elnagar, Ashraf
Al Debsi, Ridhwan
author_sort Einea, Omar
collection PubMed
description Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics. We anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic (MSA) textual data, especially for single label text classification purposes. We present the data in raw form. SANAD is composed of three main datasets scraped from three news portals, which are AlKhaleej, AlArabiya, and Akhbarona. SANAD is made public and freely available at https://data.mendeley.com/datasets/57zpx667y9.
format Online
Article
Text
id pubmed-6700340
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-67003402019-08-22 SANAD: Single-label Arabic News Articles Dataset for automatic text categorization Einea, Omar Elnagar, Ashraf Al Debsi, Ridhwan Data Brief Computer Science Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics. We anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic (MSA) textual data, especially for single label text classification purposes. We present the data in raw form. SANAD is composed of three main datasets scraped from three news portals, which are AlKhaleej, AlArabiya, and Akhbarona. SANAD is made public and freely available at https://data.mendeley.com/datasets/57zpx667y9. Elsevier 2019-06-04 /pmc/articles/PMC6700340/ /pubmed/31440535 http://dx.doi.org/10.1016/j.dib.2019.104076 Text en © 2019 The Author(s) http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Computer Science
Einea, Omar
Elnagar, Ashraf
Al Debsi, Ridhwan
SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
title SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
title_full SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
title_fullStr SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
title_full_unstemmed SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
title_short SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
title_sort sanad: single-label arabic news articles dataset for automatic text categorization
topic Computer Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6700340/
https://www.ncbi.nlm.nih.gov/pubmed/31440535
http://dx.doi.org/10.1016/j.dib.2019.104076
work_keys_str_mv AT eineaomar sanadsinglelabelarabicnewsarticlesdatasetforautomatictextcategorization
AT elnagarashraf sanadsinglelabelarabicnewsarticlesdatasetforautomatictextcategorization
AT aldebsiridhwan sanadsinglelabelarabicnewsarticlesdatasetforautomatictextcategorization