Cargando…

A large quantitative analysis of written language challenges the idea that all languages are equally complex

One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than...

Descripción completa

Detalles Bibliográficos
Autores principales:	Koplenig, Alexander, Wolfer, Sascha, Meyer, Peter
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505229/ https://www.ncbi.nlm.nih.gov/pubmed/37717109 http://dx.doi.org/10.1038/s41598-023-42327-3

_version_	1785106878964957184
author	Koplenig, Alexander Wolfer, Sascha Meyer, Peter
author_facet	Koplenig, Alexander Wolfer, Sascha Meyer, Peter
author_sort	Koplenig, Alexander
collection	PubMed
description	One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6500 different documents as represented in 41 multilingual text collections consisting of ~ 3.5 billion words or ~ 9.0 billion characters and covering 2069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of what we call average prediction complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. In addition, we show that speaker population size predicts entropy. We argue that both results constitute evidence against the equi-complexity hypothesis from an information-theoretic perspective.
format	Online Article Text
id	pubmed-10505229
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-105052292023-09-18 A large quantitative analysis of written language challenges the idea that all languages are equally complex Koplenig, Alexander Wolfer, Sascha Meyer, Peter Sci Rep Article One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6500 different documents as represented in 41 multilingual text collections consisting of ~ 3.5 billion words or ~ 9.0 billion characters and covering 2069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of what we call average prediction complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. In addition, we show that speaker population size predicts entropy. We argue that both results constitute evidence against the equi-complexity hypothesis from an information-theoretic perspective. Nature Publishing Group UK 2023-09-16 /pmc/articles/PMC10505229/ /pubmed/37717109 http://dx.doi.org/10.1038/s41598-023-42327-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Koplenig, Alexander Wolfer, Sascha Meyer, Peter A large quantitative analysis of written language challenges the idea that all languages are equally complex
title	A large quantitative analysis of written language challenges the idea that all languages are equally complex
title_full	A large quantitative analysis of written language challenges the idea that all languages are equally complex
title_fullStr	A large quantitative analysis of written language challenges the idea that all languages are equally complex
title_full_unstemmed	A large quantitative analysis of written language challenges the idea that all languages are equally complex
title_short	A large quantitative analysis of written language challenges the idea that all languages are equally complex
title_sort	large quantitative analysis of written language challenges the idea that all languages are equally complex
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505229/ https://www.ncbi.nlm.nih.gov/pubmed/37717109 http://dx.doi.org/10.1038/s41598-023-42327-3
work_keys_str_mv	AT koplenigalexander alargequantitativeanalysisofwrittenlanguagechallengestheideathatalllanguagesareequallycomplex AT wolfersascha alargequantitativeanalysisofwrittenlanguagechallengestheideathatalllanguagesareequallycomplex AT meyerpeter alargequantitativeanalysisofwrittenlanguagechallengestheideathatalllanguagesareequallycomplex AT koplenigalexander largequantitativeanalysisofwrittenlanguagechallengestheideathatalllanguagesareequallycomplex AT wolfersascha largequantitativeanalysisofwrittenlanguagechallengestheideathatalllanguagesareequallycomplex AT meyerpeter largequantitativeanalysisofwrittenlanguagechallengestheideathatalllanguagesareequallycomplex

A large quantitative analysis of written language challenges the idea that all languages are equally complex

Ejemplares similares