Cargando…

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and...

Descripción completa

Detalles Bibliográficos
Autores principales: Adjeisah, Michael, Liu, Guohua, Nyabuga, Douglas Omwenga, Nortey, Richard Nuetey, Song, Jinling
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055430/
https://www.ncbi.nlm.nih.gov/pubmed/33936190
http://dx.doi.org/10.1155/2021/6682385
_version_ 1783680447951142912
author Adjeisah, Michael
Liu, Guohua
Nyabuga, Douglas Omwenga
Nortey, Richard Nuetey
Song, Jinling
author_facet Adjeisah, Michael
Liu, Guohua
Nyabuga, Douglas Omwenga
Nortey, Richard Nuetey
Song, Jinling
author_sort Adjeisah, Michael
collection PubMed
description Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.
format Online
Article
Text
id pubmed-8055430
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-80554302021-04-29 Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation Adjeisah, Michael Liu, Guohua Nyabuga, Douglas Omwenga Nortey, Richard Nuetey Song, Jinling Comput Intell Neurosci Research Article Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores. Hindawi 2021-04-11 /pmc/articles/PMC8055430/ /pubmed/33936190 http://dx.doi.org/10.1155/2021/6682385 Text en Copyright © 2021 Michael Adjeisah et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Adjeisah, Michael
Liu, Guohua
Nyabuga, Douglas Omwenga
Nortey, Richard Nuetey
Song, Jinling
Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_full Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_fullStr Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_full_unstemmed Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_short Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
title_sort pseudotext injection and advance filtering of low-resource corpus for neural machine translation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055430/
https://www.ncbi.nlm.nih.gov/pubmed/33936190
http://dx.doi.org/10.1155/2021/6682385
work_keys_str_mv AT adjeisahmichael pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation
AT liuguohua pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation
AT nyabugadouglasomwenga pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation
AT norteyrichardnuetey pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation
AT songjinling pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation