Cargando…

The neural machine translation models for the low-resource Kazakh–English language pair

The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years...

Descripción completa

Detalles Bibliográficos
Autores principales:	Karyukin, Vladislav, Rakhimova, Diana, Karibayeva, Aidana, Turganbayeva, Aliya, Turarbek, Asem
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280589/ https://www.ncbi.nlm.nih.gov/pubmed/37346576 http://dx.doi.org/10.7717/peerj-cs.1224

_version_	1785060829249404928
author	Karyukin, Vladislav Rakhimova, Diana Karibayeva, Aidana Turganbayeva, Aliya Turarbek, Asem
author_facet	Karyukin, Vladislav Rakhimova, Diana Karibayeva, Aidana Turganbayeva, Aliya Turarbek, Asem
author_sort	Karyukin, Vladislav
collection	PubMed
description	The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh–English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh–English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.
format	Online Article Text
id	pubmed-10280589
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-102805892023-06-21 The neural machine translation models for the low-resource Kazakh–English language pair Karyukin, Vladislav Rakhimova, Diana Karibayeva, Aidana Turganbayeva, Aliya Turarbek, Asem PeerJ Comput Sci Artificial Intelligence The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh–English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh–English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER. PeerJ Inc. 2023-02-08 /pmc/articles/PMC10280589/ /pubmed/37346576 http://dx.doi.org/10.7717/peerj-cs.1224 Text en © 2023 Karyukin et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle	Artificial Intelligence Karyukin, Vladislav Rakhimova, Diana Karibayeva, Aidana Turganbayeva, Aliya Turarbek, Asem The neural machine translation models for the low-resource Kazakh–English language pair
title	The neural machine translation models for the low-resource Kazakh–English language pair
title_full	The neural machine translation models for the low-resource Kazakh–English language pair
title_fullStr	The neural machine translation models for the low-resource Kazakh–English language pair
title_full_unstemmed	The neural machine translation models for the low-resource Kazakh–English language pair
title_short	The neural machine translation models for the low-resource Kazakh–English language pair
title_sort	neural machine translation models for the low-resource kazakh–english language pair
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10280589/ https://www.ncbi.nlm.nih.gov/pubmed/37346576 http://dx.doi.org/10.7717/peerj-cs.1224
work_keys_str_mv	AT karyukinvladislav theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT rakhimovadiana theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT karibayevaaidana theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT turganbayevaaliya theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT turarbekasem theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT karyukinvladislav neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT rakhimovadiana neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT karibayevaaidana neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT turganbayevaaliya neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT turarbekasem neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair

The neural machine translation models for the low-resource Kazakh–English language pair

Ejemplares similares