Cargando…
Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Hindawi
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055430/ https://www.ncbi.nlm.nih.gov/pubmed/33936190 http://dx.doi.org/10.1155/2021/6682385 |
_version_ | 1783680447951142912 |
---|---|
author | Adjeisah, Michael Liu, Guohua Nyabuga, Douglas Omwenga Nortey, Richard Nuetey Song, Jinling |
author_facet | Adjeisah, Michael Liu, Guohua Nyabuga, Douglas Omwenga Nortey, Richard Nuetey Song, Jinling |
author_sort | Adjeisah, Michael |
collection | PubMed |
description | Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores. |
format | Online Article Text |
id | pubmed-8055430 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Hindawi |
record_format | MEDLINE/PubMed |
spelling | pubmed-80554302021-04-29 Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation Adjeisah, Michael Liu, Guohua Nyabuga, Douglas Omwenga Nortey, Richard Nuetey Song, Jinling Comput Intell Neurosci Research Article Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores. Hindawi 2021-04-11 /pmc/articles/PMC8055430/ /pubmed/33936190 http://dx.doi.org/10.1155/2021/6682385 Text en Copyright © 2021 Michael Adjeisah et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Adjeisah, Michael Liu, Guohua Nyabuga, Douglas Omwenga Nortey, Richard Nuetey Song, Jinling Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation |
title | Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation |
title_full | Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation |
title_fullStr | Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation |
title_full_unstemmed | Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation |
title_short | Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation |
title_sort | pseudotext injection and advance filtering of low-resource corpus for neural machine translation |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055430/ https://www.ncbi.nlm.nih.gov/pubmed/33936190 http://dx.doi.org/10.1155/2021/6682385 |
work_keys_str_mv | AT adjeisahmichael pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT liuguohua pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT nyabugadouglasomwenga pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT norteyrichardnuetey pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation AT songjinling pseudotextinjectionandadvancefilteringoflowresourcecorpusforneuralmachinetranslation |