Cargando…

A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking

Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translatio...

Descripción completa

Detalles Bibliográficos
Autores principales:	Madi, Nora, Al-Khalifa, Hend S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Elsevier 2018
Materias:	Computer Science
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6305806/ https://www.ncbi.nlm.nih.gov/pubmed/30591941 http://dx.doi.org/10.1016/j.dib.2018.11.146

_version_	1783382647836246016
author	Madi, Nora Al-Khalifa, Hend S.
author_facet	Madi, Nora Al-Khalifa, Hend S.
author_sort	Madi, Nora
collection	PubMed
description	Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language. This data article presents a monolingual parallel corpus of Arabic text called A7׳ta ([Image: see text]). It contains 470 erroneous sentences and their 470 error-free counterparts. This is an Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Although there are a number of available Arabic corpora of errors and corrections [2] such as QALB [10] and Arabic Learner Corpus [11], the data we present in this article is an effort to increase the number of freely available Arabic corpora of errors and corrections by providing a detailed error specification and leveraging the work of language experts.
format	Online Article Text
id	pubmed-6305806
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Elsevier
record_format	MEDLINE/PubMed
spelling	pubmed-63058062018-12-27 A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking Madi, Nora Al-Khalifa, Hend S. Data Brief Computer Science Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language. This data article presents a monolingual parallel corpus of Arabic text called A7׳ta ([Image: see text]). It contains 470 erroneous sentences and their 470 error-free counterparts. This is an Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Although there are a number of available Arabic corpora of errors and corrections [2] such as QALB [10] and Arabic Learner Corpus [11], the data we present in this article is an effort to increase the number of freely available Arabic corpora of errors and corrections by providing a detailed error specification and leveraging the work of language experts. Elsevier 2018-12-04 /pmc/articles/PMC6305806/ /pubmed/30591941 http://dx.doi.org/10.1016/j.dib.2018.11.146 Text en © 2018 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Computer Science Madi, Nora Al-Khalifa, Hend S. A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title	A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_full	A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_fullStr	A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_full_unstemmed	A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_short	A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking
title_sort	a7׳ta: data on a monolingual arabic parallel corpus for grammar checking
topic	Computer Science
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6305806/ https://www.ncbi.nlm.nih.gov/pubmed/30591941 http://dx.doi.org/10.1016/j.dib.2018.11.146
work_keys_str_mv	AT madinora a7tadataonamonolingualarabicparallelcorpusforgrammarchecking AT alkhalifahends a7tadataonamonolingualarabicparallelcorpusforgrammarchecking

A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking

Ejemplares similares