Cargando…
Improving Neural Machine Translation by Filtering Synthetic Parallel Data
Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly pai...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514558/ http://dx.doi.org/10.3390/e21121213 |
_version_ | 1783586615621320704 |
---|---|
author | Xu, Guanghao Ko, Youngjoong Seo, Jungyun |
author_facet | Xu, Guanghao Ko, Youngjoong Seo, Jungyun |
author_sort | Xu, Guanghao |
collection | PubMed |
description | Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively. |
format | Online Article Text |
id | pubmed-7514558 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-75145582020-11-09 Improving Neural Machine Translation by Filtering Synthetic Parallel Data Xu, Guanghao Ko, Youngjoong Seo, Jungyun Entropy (Basel) Article Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively. MDPI 2019-12-11 /pmc/articles/PMC7514558/ http://dx.doi.org/10.3390/e21121213 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Xu, Guanghao Ko, Youngjoong Seo, Jungyun Improving Neural Machine Translation by Filtering Synthetic Parallel Data |
title | Improving Neural Machine Translation by Filtering Synthetic Parallel Data |
title_full | Improving Neural Machine Translation by Filtering Synthetic Parallel Data |
title_fullStr | Improving Neural Machine Translation by Filtering Synthetic Parallel Data |
title_full_unstemmed | Improving Neural Machine Translation by Filtering Synthetic Parallel Data |
title_short | Improving Neural Machine Translation by Filtering Synthetic Parallel Data |
title_sort | improving neural machine translation by filtering synthetic parallel data |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514558/ http://dx.doi.org/10.3390/e21121213 |
work_keys_str_mv | AT xuguanghao improvingneuralmachinetranslationbyfilteringsyntheticparalleldata AT koyoungjoong improvingneuralmachinetranslationbyfilteringsyntheticparalleldata AT seojungyun improvingneuralmachinetranslationbyfilteringsyntheticparalleldata |