Cargando…

Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus

The availability of data is the driving force behind most of the state-of-the-art techniques for machine translation tasks. Understandably, this availability of data motivates researchers to propose new techniques and claim about the superiority of their techniques over the existing ones by using su...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rai, Sawan, Belwal, Ramesh Chandra, Gupta, Atul
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Berlin Heidelberg 2022
Materias:	Research Article-Computer Engineering and Computer Science
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9296120/ https://www.ncbi.nlm.nih.gov/pubmed/35874184 http://dx.doi.org/10.1007/s13369-022-07049-0

_version_	1784750197795979264
author	Rai, Sawan Belwal, Ramesh Chandra Gupta, Atul
author_facet	Rai, Sawan Belwal, Ramesh Chandra Gupta, Atul
author_sort	Rai, Sawan
collection	PubMed
description	The availability of data is the driving force behind most of the state-of-the-art techniques for machine translation tasks. Understandably, this availability of data motivates researchers to propose new techniques and claim about the superiority of their techniques over the existing ones by using suitable evaluation measures. However, the performance of underlying learning algorithms can be greatly influenced by the correctness and the consistency of the corpus. We present our investigations for the relevance of a publicly available python to pseudo-code parallel corpus for automated documentation task, and the studies performed using this corpus. We found that the corpus had many visible issues like overlapping of instances, inconsistency in translation styles, incompleteness, and misspelled words. We show that these discrepancies can significantly influence the performance of the learning algorithms to the extent that they could have caused previous studies to draw incorrect conclusions. We performed our experimental study using statistical machine translation and neural machine translation models. We have recorded a significant difference ([Formula: see text] 10% on BLEU score) in the models’ performance after removing the issues from the corpus.
format	Online Article Text
id	pubmed-9296120
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer Berlin Heidelberg
record_format	MEDLINE/PubMed
spelling	pubmed-92961202022-07-20 Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus Rai, Sawan Belwal, Ramesh Chandra Gupta, Atul Arab J Sci Eng Research Article-Computer Engineering and Computer Science The availability of data is the driving force behind most of the state-of-the-art techniques for machine translation tasks. Understandably, this availability of data motivates researchers to propose new techniques and claim about the superiority of their techniques over the existing ones by using suitable evaluation measures. However, the performance of underlying learning algorithms can be greatly influenced by the correctness and the consistency of the corpus. We present our investigations for the relevance of a publicly available python to pseudo-code parallel corpus for automated documentation task, and the studies performed using this corpus. We found that the corpus had many visible issues like overlapping of instances, inconsistency in translation styles, incompleteness, and misspelled words. We show that these discrepancies can significantly influence the performance of the learning algorithms to the extent that they could have caused previous studies to draw incorrect conclusions. We performed our experimental study using statistical machine translation and neural machine translation models. We have recorded a significant difference ([Formula: see text] 10% on BLEU score) in the models’ performance after removing the issues from the corpus. Springer Berlin Heidelberg 2022-07-19 2023 /pmc/articles/PMC9296120/ /pubmed/35874184 http://dx.doi.org/10.1007/s13369-022-07049-0 Text en © King Fahd University of Petroleum & Minerals 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Research Article-Computer Engineering and Computer Science Rai, Sawan Belwal, Ramesh Chandra Gupta, Atul Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus
title	Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus
title_full	Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus
title_fullStr	Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus
title_full_unstemmed	Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus
title_short	Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus
title_sort	is the corpus ready for machine translation? a case study with python to pseudo-code corpus
topic	Research Article-Computer Engineering and Computer Science
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9296120/ https://www.ncbi.nlm.nih.gov/pubmed/35874184 http://dx.doi.org/10.1007/s13369-022-07049-0
work_keys_str_mv	AT raisawan isthecorpusreadyformachinetranslationacasestudywithpythontopseudocodecorpus AT belwalrameshchandra isthecorpusreadyformachinetranslationacasestudywithpythontopseudocodecorpus AT guptaatul isthecorpusreadyformachinetranslationacasestudywithpythontopseudocodecorpus

Is the Corpus Ready for Machine Translation? A Case Study with Python to Pseudo-Code Corpus

Ejemplares similares