Cargando…
SANAD: Single-label Arabic News Articles Dataset for automatic text categorization
Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for tr...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6700340/ https://www.ncbi.nlm.nih.gov/pubmed/31440535 http://dx.doi.org/10.1016/j.dib.2019.104076 |
_version_ | 1783444853056602112 |
---|---|
author | Einea, Omar Elnagar, Ashraf Al Debsi, Ridhwan |
author_facet | Einea, Omar Elnagar, Ashraf Al Debsi, Ridhwan |
author_sort | Einea, Omar |
collection | PubMed |
description | Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics. We anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic (MSA) textual data, especially for single label text classification purposes. We present the data in raw form. SANAD is composed of three main datasets scraped from three news portals, which are AlKhaleej, AlArabiya, and Akhbarona. SANAD is made public and freely available at https://data.mendeley.com/datasets/57zpx667y9. |
format | Online Article Text |
id | pubmed-6700340 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-67003402019-08-22 SANAD: Single-label Arabic News Articles Dataset for automatic text categorization Einea, Omar Elnagar, Ashraf Al Debsi, Ridhwan Data Brief Computer Science Text Classification is one of the most popular Natural Language Processing (NLP) tasks. Text classification (aka categorization) is an active research topic in recent years. However, much less attention was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. Therefore, we introduce a large Single-labeled Arabic News Articles Dataset (SANAD) of textual data collected from three news portals. The dataset is a large one consisting of almost 200k articles distributed into seven categories that we offer to the research community on Arabic computational linguistics. We anticipate that this rich dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic (MSA) textual data, especially for single label text classification purposes. We present the data in raw form. SANAD is composed of three main datasets scraped from three news portals, which are AlKhaleej, AlArabiya, and Akhbarona. SANAD is made public and freely available at https://data.mendeley.com/datasets/57zpx667y9. Elsevier 2019-06-04 /pmc/articles/PMC6700340/ /pubmed/31440535 http://dx.doi.org/10.1016/j.dib.2019.104076 Text en © 2019 The Author(s) http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Computer Science Einea, Omar Elnagar, Ashraf Al Debsi, Ridhwan SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
title | SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
title_full | SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
title_fullStr | SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
title_full_unstemmed | SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
title_short | SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
title_sort | sanad: single-label arabic news articles dataset for automatic text categorization |
topic | Computer Science |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6700340/ https://www.ncbi.nlm.nih.gov/pubmed/31440535 http://dx.doi.org/10.1016/j.dib.2019.104076 |
work_keys_str_mv | AT eineaomar sanadsinglelabelarabicnewsarticlesdatasetforautomatictextcategorization AT elnagarashraf sanadsinglelabelarabicnewsarticlesdatasetforautomatictextcategorization AT aldebsiridhwan sanadsinglelabelarabicnewsarticlesdatasetforautomatictextcategorization |