Cargando…

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems

Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation nee...

Descripción completa

Detalles Bibliográficos
Autores principales: Zerrouki, Taha, Balla, Amar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5310197/
https://www.ncbi.nlm.nih.gov/pubmed/28224131
http://dx.doi.org/10.1016/j.dib.2017.01.011
_version_ 1782507834033831936
author Zerrouki, Taha
Balla, Amar
author_facet Zerrouki, Taha
Balla, Amar
author_sort Zerrouki, Taha
collection PubMed
description Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe our corpus of Arabic diacritized texts. This corpus is called Tashkeela. It can be used as a linguistic resource tool for natural language processing such as automatic diacritics systems, dis-ambiguity mechanism, features and data extraction. The corpus is freely available, it contains 75 million of fully vocalized words mainly 97 books from classical and modern Arabic language. The corpus is collected from manually vocalized texts using web crawling process.
format Online
Article
Text
id pubmed-5310197
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-53101972017-02-21 Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems Zerrouki, Taha Balla, Amar Data Brief Data Article Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts. The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems. In this paper, we describe our corpus of Arabic diacritized texts. This corpus is called Tashkeela. It can be used as a linguistic resource tool for natural language processing such as automatic diacritics systems, dis-ambiguity mechanism, features and data extraction. The corpus is freely available, it contains 75 million of fully vocalized words mainly 97 books from classical and modern Arabic language. The corpus is collected from manually vocalized texts using web crawling process. Elsevier 2017-02-03 /pmc/articles/PMC5310197/ /pubmed/28224131 http://dx.doi.org/10.1016/j.dib.2017.01.011 Text en © 2017 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Data Article
Zerrouki, Taha
Balla, Amar
Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems
title Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems
title_full Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems
title_fullStr Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems
title_full_unstemmed Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems
title_short Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems
title_sort tashkeela: novel corpus of arabic vocalized texts, data for auto-diacritization systems
topic Data Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5310197/
https://www.ncbi.nlm.nih.gov/pubmed/28224131
http://dx.doi.org/10.1016/j.dib.2017.01.011
work_keys_str_mv AT zerroukitaha tashkeelanovelcorpusofarabicvocalizedtextsdataforautodiacritizationsystems
AT ballaamar tashkeelanovelcorpusofarabicvocalizedtextsdataforautodiacritizationsystems