Cargando…

Domain adaptation of statistical machine translation with domain-focused web crawling

In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and paral...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pecina, Pavel, Toral, Antonio, Papavassiliou, Vassilis, Prokopidis, Prokopis, Tamchyna, Aleš, Way, Andy, van Genabith, Josef
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Netherlands 2014
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479164/ https://www.ncbi.nlm.nih.gov/pubmed/26120290 http://dx.doi.org/10.1007/s10579-014-9282-3

_version_	1782377976576344064
author	Pecina, Pavel Toral, Antonio Papavassiliou, Vassilis Prokopidis, Prokopis Tamchyna, Aleš Way, Andy van Genabith, Josef
author_facet	Pecina, Pavel Toral, Antonio Papavassiliou, Vassilis Prokopidis, Prokopis Tamchyna, Aleš Way, Andy van Genabith, Josef
author_sort	Pecina, Pavel
collection	PubMed
description	In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.
format	Online Article Text
id	pubmed-4479164
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Springer Netherlands
record_format	MEDLINE/PubMed
spelling	pubmed-44791642015-06-26 Domain adaptation of statistical machine translation with domain-focused web crawling Pecina, Pavel Toral, Antonio Papavassiliou, Vassilis Prokopidis, Prokopis Tamchyna, Aleš Way, Andy van Genabith, Josef Lang Resour Eval Original Paper In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute. Springer Netherlands 2014-12-03 2015 /pmc/articles/PMC4479164/ /pubmed/26120290 http://dx.doi.org/10.1007/s10579-014-9282-3 Text en © The Author(s) 2015 https://creativecommons.org/licenses/by/4.0/ Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
spellingShingle	Original Paper Pecina, Pavel Toral, Antonio Papavassiliou, Vassilis Prokopidis, Prokopis Tamchyna, Aleš Way, Andy van Genabith, Josef Domain adaptation of statistical machine translation with domain-focused web crawling
title	Domain adaptation of statistical machine translation with domain-focused web crawling
title_full	Domain adaptation of statistical machine translation with domain-focused web crawling
title_fullStr	Domain adaptation of statistical machine translation with domain-focused web crawling
title_full_unstemmed	Domain adaptation of statistical machine translation with domain-focused web crawling
title_short	Domain adaptation of statistical machine translation with domain-focused web crawling
title_sort	domain adaptation of statistical machine translation with domain-focused web crawling
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479164/ https://www.ncbi.nlm.nih.gov/pubmed/26120290 http://dx.doi.org/10.1007/s10579-014-9282-3
work_keys_str_mv	AT pecinapavel domainadaptationofstatisticalmachinetranslationwithdomainfocusedwebcrawling AT toralantonio domainadaptationofstatisticalmachinetranslationwithdomainfocusedwebcrawling AT papavassiliouvassilis domainadaptationofstatisticalmachinetranslationwithdomainfocusedwebcrawling AT prokopidisprokopis domainadaptationofstatisticalmachinetranslationwithdomainfocusedwebcrawling AT tamchynaales domainadaptationofstatisticalmachinetranslationwithdomainfocusedwebcrawling AT wayandy domainadaptationofstatisticalmachinetranslationwithdomainfocusedwebcrawling AT vangenabithjosef domainadaptationofstatisticalmachinetranslationwithdomainfocusedwebcrawling

Domain adaptation of statistical machine translation with domain-focused web crawling

Ejemplares similares