Cargando…
Multi-components System for Automatic Arabic Diacritization
In this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and Dense layers, a character-level rule-based corrector...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148237/ http://dx.doi.org/10.1007/978-3-030-45439-5_23 |
_version_ | 1783520550504628224 |
---|---|
author | Abbad, Hamza Xiong, Shengwu |
author_facet | Abbad, Hamza Xiong, Shengwu |
author_sort | Abbad, Hamza |
collection | PubMed |
description | In this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and Dense layers, a character-level rule-based corrector which applies deterministic operations to prevent some errors, and a word-level statistical corrector which uses the context and the distance information to fix some diacritization issues. This approach is novel in a way that combines methods of different types and adds edit distance based corrections. We used a large public dataset containing raw diacritized Arabic text (Tashkeela) for training and testing our system after cleaning and normalizing it. On a newly-released benchmark test set, our system outperformed all the tested systems by achieving DER of 3.39% and WER of 9.94% when taking all Arabic letters into account, DER of 2.61% and WER of 5.83% when ignoring the diacritization of the last letter of every word. |
format | Online Article Text |
id | pubmed-7148237 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-71482372020-04-13 Multi-components System for Automatic Arabic Diacritization Abbad, Hamza Xiong, Shengwu Advances in Information Retrieval Article In this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and Dense layers, a character-level rule-based corrector which applies deterministic operations to prevent some errors, and a word-level statistical corrector which uses the context and the distance information to fix some diacritization issues. This approach is novel in a way that combines methods of different types and adds edit distance based corrections. We used a large public dataset containing raw diacritized Arabic text (Tashkeela) for training and testing our system after cleaning and normalizing it. On a newly-released benchmark test set, our system outperformed all the tested systems by achieving DER of 3.39% and WER of 9.94% when taking all Arabic letters into account, DER of 2.61% and WER of 5.83% when ignoring the diacritization of the last letter of every word. 2020-03-17 /pmc/articles/PMC7148237/ http://dx.doi.org/10.1007/978-3-030-45439-5_23 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Abbad, Hamza Xiong, Shengwu Multi-components System for Automatic Arabic Diacritization |
title | Multi-components System for Automatic Arabic Diacritization |
title_full | Multi-components System for Automatic Arabic Diacritization |
title_fullStr | Multi-components System for Automatic Arabic Diacritization |
title_full_unstemmed | Multi-components System for Automatic Arabic Diacritization |
title_short | Multi-components System for Automatic Arabic Diacritization |
title_sort | multi-components system for automatic arabic diacritization |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148237/ http://dx.doi.org/10.1007/978-3-030-45439-5_23 |
work_keys_str_mv | AT abbadhamza multicomponentssystemforautomaticarabicdiacritization AT xiongshengwu multicomponentssystemforautomaticarabicdiacritization |