Cargando…

Machine Translation Utilizing the Frequent-Item Set Concept

In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that...

Descripción completa

Detalles Bibliográficos
Autores principales: Mahmoud, Hanan A. Hosni, Mengash, Hanan Abdullah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7926351/
https://www.ncbi.nlm.nih.gov/pubmed/33670035
http://dx.doi.org/10.3390/s21041493
_version_ 1783659445796995072
author Mahmoud, Hanan A. Hosni
Mengash, Hanan Abdullah
author_facet Mahmoud, Hanan A. Hosni
Mengash, Hanan Abdullah
author_sort Mahmoud, Hanan A. Hosni
collection PubMed
description In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5–20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation.
format Online
Article
Text
id pubmed-7926351
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-79263512021-03-04 Machine Translation Utilizing the Frequent-Item Set Concept Mahmoud, Hanan A. Hosni Mengash, Hanan Abdullah Sensors (Basel) Article In this paper, we introduce new concepts in the machine translation paradigm. We treat the corpus as a database of frequent word sets. A translation request triggers association rules joining phrases present in the source language, and phrases present in the target language. It has to be noted that a sequential scan of the corpus for such phrases will increase the response time in an unexpected manner. We introduce the pre-processing of the bilingual corpus through proposing a data structure called Corpus-Trie (CT) that renders a bilingual parallel corpus in a compact data structure representing frequent data items sets. We also present algorithms which utilize the CT to respond to translation requests and explore novel techniques in exhaustive experiments. Experiments were performed on specific language pairs, although the proposed method is not restricted to any specific language. Moreover, the proposed Corpus-Trie can be extended from bilingual corpora to accommodate multi-language corpora. Experiments indicated that the response time of a translation request is logarithmic to the count of unrepeated phrases in the original bilingual corpus (and thus, the Corpus-Trie size). In practical situations, 5–20% of the log of the number of the nodes have to be visited. The experimental results indicate that the BLEU score for the proposed CT system increases with the size of the number of phrases in the CT, for both English-Arabic and English-French translations. The proposed CT system was demonstrated to be better than both Omega-T and Apertium in quality of translation from a corpus size exceeding 1,600,000 phrases for English-Arabic translation, and 300,000 phrases for English-French translation. MDPI 2021-02-21 /pmc/articles/PMC7926351/ /pubmed/33670035 http://dx.doi.org/10.3390/s21041493 Text en © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Mahmoud, Hanan A. Hosni
Mengash, Hanan Abdullah
Machine Translation Utilizing the Frequent-Item Set Concept
title Machine Translation Utilizing the Frequent-Item Set Concept
title_full Machine Translation Utilizing the Frequent-Item Set Concept
title_fullStr Machine Translation Utilizing the Frequent-Item Set Concept
title_full_unstemmed Machine Translation Utilizing the Frequent-Item Set Concept
title_short Machine Translation Utilizing the Frequent-Item Set Concept
title_sort machine translation utilizing the frequent-item set concept
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7926351/
https://www.ncbi.nlm.nih.gov/pubmed/33670035
http://dx.doi.org/10.3390/s21041493
work_keys_str_mv AT mahmoudhananahosni machinetranslationutilizingthefrequentitemsetconcept
AT mengashhananabdullah machinetranslationutilizingthefrequentitemsetconcept