Cargando…

Automated content analysis across six languages

Corpus selection bias in international relations research presents an epistemological problem: How do we know what we know? Most social science research in the field of text analytics relies on English language corpora, biasing our ability to understand international phenomena. To address the issue...

Descripción completa

Detalles Bibliográficos
Autores principales: Windsor, Leah Cathryn, Cupit, James Grayson, Windsor, Alistair James
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6867602/
https://www.ncbi.nlm.nih.gov/pubmed/31747404
http://dx.doi.org/10.1371/journal.pone.0224425
_version_ 1783472104931328000
author Windsor, Leah Cathryn
Cupit, James Grayson
Windsor, Alistair James
author_facet Windsor, Leah Cathryn
Cupit, James Grayson
Windsor, Alistair James
author_sort Windsor, Leah Cathryn
collection PubMed
description Corpus selection bias in international relations research presents an epistemological problem: How do we know what we know? Most social science research in the field of text analytics relies on English language corpora, biasing our ability to understand international phenomena. To address the issue of corpus selection bias, we introduce results that suggest that machine translation may be used to address non-English sources. We use human translation and machine translation (Google Translate) on a collection of aligned sentences from United Nations documents extracted from the Multi-UN corpus, analyzed with a “bag of words” analysis tool, Linguistic Inquiry Word Count (LIWC). Overall, the LIWC indices proved relatively stable across machine and human translated sentences. We find that while there are statistically significant differences between the original and translated documents, the effect sizes are relatively small, especially when looking at psychological processes.
format Online
Article
Text
id pubmed-6867602
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-68676022019-12-07 Automated content analysis across six languages Windsor, Leah Cathryn Cupit, James Grayson Windsor, Alistair James PLoS One Research Article Corpus selection bias in international relations research presents an epistemological problem: How do we know what we know? Most social science research in the field of text analytics relies on English language corpora, biasing our ability to understand international phenomena. To address the issue of corpus selection bias, we introduce results that suggest that machine translation may be used to address non-English sources. We use human translation and machine translation (Google Translate) on a collection of aligned sentences from United Nations documents extracted from the Multi-UN corpus, analyzed with a “bag of words” analysis tool, Linguistic Inquiry Word Count (LIWC). Overall, the LIWC indices proved relatively stable across machine and human translated sentences. We find that while there are statistically significant differences between the original and translated documents, the effect sizes are relatively small, especially when looking at psychological processes. Public Library of Science 2019-11-20 /pmc/articles/PMC6867602/ /pubmed/31747404 http://dx.doi.org/10.1371/journal.pone.0224425 Text en © 2019 Windsor et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Windsor, Leah Cathryn
Cupit, James Grayson
Windsor, Alistair James
Automated content analysis across six languages
title Automated content analysis across six languages
title_full Automated content analysis across six languages
title_fullStr Automated content analysis across six languages
title_full_unstemmed Automated content analysis across six languages
title_short Automated content analysis across six languages
title_sort automated content analysis across six languages
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6867602/
https://www.ncbi.nlm.nih.gov/pubmed/31747404
http://dx.doi.org/10.1371/journal.pone.0224425
work_keys_str_mv AT windsorleahcathryn automatedcontentanalysisacrosssixlanguages
AT cupitjamesgrayson automatedcontentanalysisacrosssixlanguages
AT windsoralistairjames automatedcontentanalysisacrosssixlanguages