Cargando…

A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking

Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translatio...

Descripción completa

Detalles Bibliográficos
Autores principales: Madi, Nora, Al-Khalifa, Hend S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6305806/
https://www.ncbi.nlm.nih.gov/pubmed/30591941
http://dx.doi.org/10.1016/j.dib.2018.11.146
_version_ 1783382647836246016
author Madi, Nora
Al-Khalifa, Hend S.
author_facet Madi, Nora
Al-Khalifa, Hend S.
author_sort Madi, Nora
collection PubMed
description Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language. This data article presents a monolingual parallel corpus of Arabic text called A7׳ta ([Image: see text]). It contains 470 erroneous sentences and their 470 error-free counterparts. This is an Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Although there are a number of available Arabic corpora of errors and corrections [2] such as QALB [10] and Arabic Learner Corpus [11], the data we present in this article is an effort to increase the number of freely available Arabic corpora of errors and corrections by providing a detailed error specification and leveraging the work of language experts.
format Online
Article
Text
id pubmed-6305806
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-63058062018-12-27 A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking Madi, Nora Al-Khalifa, Hend S. Data Brief Computer Science Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language. This data article presents a monolingual parallel corpus of Arabic text called A7׳ta ([Image: see text]). It contains 470 erroneous sentences and their 470 error-free counterparts. This is an Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Although there are a number of available Arabic corpora of errors and corrections [2] such as QALB [10] and Arabic Learner Corpus [11], the data we present in this article is an effort to increase the number of freely available Arabic corpora of errors and corrections by providing a detailed error specification and leveraging the work of language experts. Elsevier 2018-12-04 /pmc/articles/PMC6305806/ /pubmed/30591941 http://dx.doi.org/10.1016/j.dib.2018.11.146 Text en © 2018 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Computer Science
Madi, Nora
Al-Khalifa, Hend S.
A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_full A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_fullStr A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_full_unstemmed A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_short A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_sort a7׳ta: data on a monolingual arabic parallel corpus for grammar checking
topic Computer Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6305806/
https://www.ncbi.nlm.nih.gov/pubmed/30591941
http://dx.doi.org/10.1016/j.dib.2018.11.146
work_keys_str_mv AT madinora a7tadataonamonolingualarabicparallelcorpusforgrammarchecking
AT alkhalifahends a7tadataonamonolingualarabicparallelcorpusforgrammarchecking