Cargando…

Improving Neural Machine Translation by Filtering Synthetic Parallel Data

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly pai...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xu, Guanghao, Ko, Youngjoong, Seo, Jungyun
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514558/ http://dx.doi.org/10.3390/e21121213

_version_	1783586615621320704
author	Xu, Guanghao Ko, Youngjoong Seo, Jungyun
author_facet	Xu, Guanghao Ko, Youngjoong Seo, Jungyun
author_sort	Xu, Guanghao
collection	PubMed
description	Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively.
format	Online Article Text
id	pubmed-7514558
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75145582020-11-09 Improving Neural Machine Translation by Filtering Synthetic Parallel Data Xu, Guanghao Ko, Youngjoong Seo, Jungyun Entropy (Basel) Article Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively. MDPI 2019-12-11 /pmc/articles/PMC7514558/ http://dx.doi.org/10.3390/e21121213 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Xu, Guanghao Ko, Youngjoong Seo, Jungyun Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title	Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_full	Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_fullStr	Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_full_unstemmed	Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_short	Improving Neural Machine Translation by Filtering Synthetic Parallel Data
title_sort	improving neural machine translation by filtering synthetic parallel data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514558/ http://dx.doi.org/10.3390/e21121213
work_keys_str_mv	AT xuguanghao improvingneuralmachinetranslationbyfilteringsyntheticparalleldata AT koyoungjoong improvingneuralmachinetranslationbyfilteringsyntheticparalleldata AT seojungyun improvingneuralmachinetranslationbyfilteringsyntheticparalleldata

Improving Neural Machine Translation by Filtering Synthetic Parallel Data

Ejemplares similares