Cargando…

From Boltzmann to Zipf through Shannon and Jaynes

The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (...

Descripción completa

Detalles Bibliográficos
Autores principales: Corral, Álvaro, García del Muro, Montserrat
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516604/
https://www.ncbi.nlm.nih.gov/pubmed/33285954
http://dx.doi.org/10.3390/e22020179
_version_ 1783587039485100032
author Corral, Álvaro
García del Muro, Montserrat
author_facet Corral, Álvaro
García del Muro, Montserrat
author_sort Corral, Álvaro
collection PubMed
description The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws.
format Online
Article
Text
id pubmed-7516604
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75166042020-11-09 From Boltzmann to Zipf through Shannon and Jaynes Corral, Álvaro García del Muro, Montserrat Entropy (Basel) Article The word-frequency distribution provides the fundamental building blocks that generate discourse in natural language. It is well known, from empirical evidence, that the word-frequency distribution of almost any text is described by Zipf’s law, at least approximately. Following Stephens and Bialek (2010), we interpret the frequency of any word as arising from the interaction potentials between its constituent letters. Indeed, Jaynes’ maximum-entropy principle, with the constrains given by every empirical two-letter marginal distribution, leads to a Boltzmann distribution for word probabilities, with an energy-like function given by the sum of the all-to-all pairwise (two-letter) potentials. The so-called improved iterative-scaling algorithm allows us finding the potentials from the empirical two-letter marginals. We considerably extend Stephens and Bialek’s results, applying this formalism to words with length of up to six letters from the English subset of the recently created Standardized Project Gutenberg Corpus. We find that the model is able to reproduce Zipf’s law, but with some limitations: the general Zipf’s power-law regime is obtained, but the probability of individual words shows considerable scattering. In this way, a pure statistical-physics framework is used to describe the probabilities of words. As a by-product, we find that both the empirical two-letter marginal distributions and the interaction-potential distributions follow well-defined statistical laws. MDPI 2020-02-05 /pmc/articles/PMC7516604/ /pubmed/33285954 http://dx.doi.org/10.3390/e22020179 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Corral, Álvaro
García del Muro, Montserrat
From Boltzmann to Zipf through Shannon and Jaynes
title From Boltzmann to Zipf through Shannon and Jaynes
title_full From Boltzmann to Zipf through Shannon and Jaynes
title_fullStr From Boltzmann to Zipf through Shannon and Jaynes
title_full_unstemmed From Boltzmann to Zipf through Shannon and Jaynes
title_short From Boltzmann to Zipf through Shannon and Jaynes
title_sort from boltzmann to zipf through shannon and jaynes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516604/
https://www.ncbi.nlm.nih.gov/pubmed/33285954
http://dx.doi.org/10.3390/e22020179
work_keys_str_mv AT corralalvaro fromboltzmanntozipfthroughshannonandjaynes
AT garciadelmuromontserrat fromboltzmanntozipfthroughshannonandjaynes