Cargando…

Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques

The present research reports on the use of data mining techniques for differentiating between translated and non-translated original Chinese based on monolingual comparable corpora. We operationalized seven entropy-based metrics including character, wordform unigram, wordform bigram and wordform tri...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Kanglong, Ye, Rongguang, Zhongzhu, Liu, Ye, Rongye
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8947138/
https://www.ncbi.nlm.nih.gov/pubmed/35324927
http://dx.doi.org/10.1371/journal.pone.0265633
_version_ 1784674367899172864
author Liu, Kanglong
Ye, Rongguang
Zhongzhu, Liu
Ye, Rongye
author_facet Liu, Kanglong
Ye, Rongguang
Zhongzhu, Liu
Ye, Rongye
author_sort Liu, Kanglong
collection PubMed
description The present research reports on the use of data mining techniques for differentiating between translated and non-translated original Chinese based on monolingual comparable corpora. We operationalized seven entropy-based metrics including character, wordform unigram, wordform bigram and wordform trigram, POS (Part-of-speech) unigram, POS bigram and POS trigram entropy from two balanced Chinese comparable corpora (translated vs non-translated) for data mining and analysis. We then applied four data mining techniques including Support Vector Machines (SVMs), Linear discriminant analysis (LDA), Random Forest (RF) and Multilayer Perceptron (MLP) to distinguish translated Chinese from original Chinese based on these seven features. Our results show that SVMs is the most robust and effective classifier, yielding an AUC of 90.5% and an accuracy rate of 84.3%. Our results have affirmed the hypothesis that translational language is categorically different from original language. Our research demonstrates that combining information-theoretic indicator of Shannon’s entropy together with machine learning techniques can provide a novel approach for studying translation as a unique communicative activity. This study has yielded new insights for corpus-based studies on the translationese phenomenon in the field of translation studies.
format Online
Article
Text
id pubmed-8947138
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-89471382022-03-25 Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques Liu, Kanglong Ye, Rongguang Zhongzhu, Liu Ye, Rongye PLoS One Research Article The present research reports on the use of data mining techniques for differentiating between translated and non-translated original Chinese based on monolingual comparable corpora. We operationalized seven entropy-based metrics including character, wordform unigram, wordform bigram and wordform trigram, POS (Part-of-speech) unigram, POS bigram and POS trigram entropy from two balanced Chinese comparable corpora (translated vs non-translated) for data mining and analysis. We then applied four data mining techniques including Support Vector Machines (SVMs), Linear discriminant analysis (LDA), Random Forest (RF) and Multilayer Perceptron (MLP) to distinguish translated Chinese from original Chinese based on these seven features. Our results show that SVMs is the most robust and effective classifier, yielding an AUC of 90.5% and an accuracy rate of 84.3%. Our results have affirmed the hypothesis that translational language is categorically different from original language. Our research demonstrates that combining information-theoretic indicator of Shannon’s entropy together with machine learning techniques can provide a novel approach for studying translation as a unique communicative activity. This study has yielded new insights for corpus-based studies on the translationese phenomenon in the field of translation studies. Public Library of Science 2022-03-24 /pmc/articles/PMC8947138/ /pubmed/35324927 http://dx.doi.org/10.1371/journal.pone.0265633 Text en © 2022 Liu et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Liu, Kanglong
Ye, Rongguang
Zhongzhu, Liu
Ye, Rongye
Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques
title Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques
title_full Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques
title_fullStr Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques
title_full_unstemmed Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques
title_short Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques
title_sort entropy-based discrimination between translated chinese and original chinese using data mining techniques
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8947138/
https://www.ncbi.nlm.nih.gov/pubmed/35324927
http://dx.doi.org/10.1371/journal.pone.0265633
work_keys_str_mv AT liukanglong entropybaseddiscriminationbetweentranslatedchineseandoriginalchineseusingdataminingtechniques
AT yerongguang entropybaseddiscriminationbetweentranslatedchineseandoriginalchineseusingdataminingtechniques
AT zhongzhuliu entropybaseddiscriminationbetweentranslatedchineseandoriginalchineseusingdataminingtechniques
AT yerongye entropybaseddiscriminationbetweentranslatedchineseandoriginalchineseusingdataminingtechniques