Cargando…

IADD: An integrated Arabic dialect identification dataset

Arabic language has different variants that can be roughly categorized into three main categories: Classical Arabic (CA), Modern Standard Arabic (MSA) and Dialectal Arabic (DA). There are subtle differences between MSA and CA in terms of syntax, terminology and pronunciation. However, Dialectal Arab...

Descripción completa

Detalles Bibliográficos
Autor principal: Zahir, Jihad
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8741435/
https://www.ncbi.nlm.nih.gov/pubmed/35028349
http://dx.doi.org/10.1016/j.dib.2021.107777
_version_ 1784629489587716096
author Zahir, Jihad
author_facet Zahir, Jihad
author_sort Zahir, Jihad
collection PubMed
description Arabic language has different variants that can be roughly categorized into three main categories: Classical Arabic (CA), Modern Standard Arabic (MSA) and Dialectal Arabic (DA). There are subtle differences between MSA and CA in terms of syntax, terminology and pronunciation. However, Dialectal Arabic (DA) significantly differs from CA and MSA in that it reflects geographic location of the speaker, or at least the country of origin, if mobility factors are taken into account. This paper presents IADD, an Integrated dataset for Arabic dialect identification, that contains [Formula: see text] texts representing Arabic dialects from 5 regions and 9 countries. IADD dataset is created, from the combination of subsets of five corpora, to support the task of automatic Arabic dialects detection.
format Online
Article
Text
id pubmed-8741435
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-87414352022-01-12 IADD: An integrated Arabic dialect identification dataset Zahir, Jihad Data Brief Data Article Arabic language has different variants that can be roughly categorized into three main categories: Classical Arabic (CA), Modern Standard Arabic (MSA) and Dialectal Arabic (DA). There are subtle differences between MSA and CA in terms of syntax, terminology and pronunciation. However, Dialectal Arabic (DA) significantly differs from CA and MSA in that it reflects geographic location of the speaker, or at least the country of origin, if mobility factors are taken into account. This paper presents IADD, an Integrated dataset for Arabic dialect identification, that contains [Formula: see text] texts representing Arabic dialects from 5 regions and 9 countries. IADD dataset is created, from the combination of subsets of five corpora, to support the task of automatic Arabic dialects detection. Elsevier 2021-12-30 /pmc/articles/PMC8741435/ /pubmed/35028349 http://dx.doi.org/10.1016/j.dib.2021.107777 Text en © 2021 The Author. Published by Elsevier Inc. https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Zahir, Jihad
IADD: An integrated Arabic dialect identification dataset
title IADD: An integrated Arabic dialect identification dataset
title_full IADD: An integrated Arabic dialect identification dataset
title_fullStr IADD: An integrated Arabic dialect identification dataset
title_full_unstemmed IADD: An integrated Arabic dialect identification dataset
title_short IADD: An integrated Arabic dialect identification dataset
title_sort iadd: an integrated arabic dialect identification dataset
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8741435/
https://www.ncbi.nlm.nih.gov/pubmed/35028349
http://dx.doi.org/10.1016/j.dib.2021.107777
work_keys_str_mv AT zahirjihad iaddanintegratedarabicdialectidentificationdataset