Cargando…

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size

Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the co...

Descripción completa

Detalles Bibliográficos
Autores principales:	Koplenig, Alexander, Wolfer, Sascha, Müller-Spitzer, Carolin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514953/ https://www.ncbi.nlm.nih.gov/pubmed/33267178 http://dx.doi.org/10.3390/e21050464

_version_	1783586706663931904
author	Koplenig, Alexander Wolfer, Sascha Müller-Spitzer, Carolin
author_facet	Koplenig, Alexander Wolfer, Sascha Müller-Spitzer, Carolin
author_sort	Koplenig, Alexander
collection	PubMed
description	Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.
format	Online Article Text
id	pubmed-7514953
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75149532020-11-09 Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size Koplenig, Alexander Wolfer, Sascha Müller-Spitzer, Carolin Entropy (Basel) Article Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf’s law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages. MDPI 2019-05-03 /pmc/articles/PMC7514953/ /pubmed/33267178 http://dx.doi.org/10.3390/e21050464 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Koplenig, Alexander Wolfer, Sascha Müller-Spitzer, Carolin Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
title	Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
title_full	Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
title_fullStr	Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
title_full_unstemmed	Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
title_short	Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size
title_sort	studying lexical dynamics and language change via generalized entropies: the problem of sample size
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7514953/ https://www.ncbi.nlm.nih.gov/pubmed/33267178 http://dx.doi.org/10.3390/e21050464
work_keys_str_mv	AT koplenigalexander studyinglexicaldynamicsandlanguagechangeviageneralizedentropiestheproblemofsamplesize AT wolfersascha studyinglexicaldynamicsandlanguagechangeviageneralizedentropiestheproblemofsamplesize AT mullerspitzercarolin studyinglexicaldynamicsandlanguagechangeviageneralizedentropiestheproblemofsamplesize

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size

Ejemplares similares