Cargando…

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techn...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Longyue, Wong, Derek F., Chao, Lidia S., Lu, Yi, Xing, Junwen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi Publishing Corporation 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3934767/
https://www.ncbi.nlm.nih.gov/pubmed/24683356
http://dx.doi.org/10.1155/2014/745485
_version_ 1782305095877132288
author Wang, Longyue
Wong, Derek F.
Chao, Lidia S.
Lu, Yi
Xing, Junwen
author_facet Wang, Longyue
Wong, Derek F.
Chao, Lidia S.
Lu, Yi
Xing, Junwen
author_sort Wang, Longyue
collection PubMed
description Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.
format Online
Article
Text
id pubmed-3934767
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Hindawi Publishing Corporation
record_format MEDLINE/PubMed
spelling pubmed-39347672014-03-30 A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation Wang, Longyue Wong, Derek F. Chao, Lidia S. Lu, Yi Xing, Junwen ScientificWorldJournal Research Article Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system. Hindawi Publishing Corporation 2014-02-11 /pmc/articles/PMC3934767/ /pubmed/24683356 http://dx.doi.org/10.1155/2014/745485 Text en Copyright © 2014 Longyue Wang et al. https://creativecommons.org/licenses/by/3.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Wang, Longyue
Wong, Derek F.
Chao, Lidia S.
Lu, Yi
Xing, Junwen
A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_full A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_fullStr A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_full_unstemmed A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_short A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
title_sort systematic comparison of data selection criteria for smt domain adaptation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3934767/
https://www.ncbi.nlm.nih.gov/pubmed/24683356
http://dx.doi.org/10.1155/2014/745485
work_keys_str_mv AT wanglongyue asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT wongderekf asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT chaolidias asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT luyi asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT xingjunwen asystematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT wanglongyue systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT wongderekf systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT chaolidias systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT luyi systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation
AT xingjunwen systematiccomparisonofdataselectioncriteriaforsmtdomainadaptation