Cargando…

Improving Neural Machine Translation by Filtering Synthetic Parallel Data

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly pai...

Descripción completa

Detalles Bibliográficos
Autores principales: Xu, Guanghao, Ko, Youngjoong, Seo, Jungyun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514558/
http://dx.doi.org/10.3390/e21121213
_version_ 1783586615621320704
author Xu, Guanghao
Ko, Youngjoong
Seo, Jungyun
author_facet Xu, Guanghao
Ko, Youngjoong
Seo, Jungyun
author_sort Xu, Guanghao
collection PubMed
description Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.
format Online
Article
Text
id pubmed-7514558
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75145582020-11-09 Improving Neural Machine Translation by Filtering Synthetic Parallel Data Xu, Guanghao Ko, Youngjoong Seo, Jungyun Entropy (Basel) Article Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively. MDPI 2019-12-11 /pmc/articles/PMC7514558/ http://dx.doi.org/10.3390/e21121213 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Xu, Guanghao
Ko, Youngjoong
Seo, Jungyun
Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_full Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_fullStr Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_full_unstemmed Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_short Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_sort improving neural machine translation by filtering synthetic parallel data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514558/
http://dx.doi.org/10.3390/e21121213
work_keys_str_mv AT xuguanghao improvingneuralmachinetranslationbyfilteringsyntheticparalleldata
AT koyoungjoong improvingneuralmachinetranslationbyfilteringsyntheticparalleldata
AT seojungyun improvingneuralmachinetranslationbyfilteringsyntheticparalleldata