Cargando…

Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution

BACKGROUND: Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank [Image: see text], the 2nd most frequent word has rank [Image: see text],…) is approximately linear when plotted on a double logarithmic scale. It has been...

Descripción completa

Detalles Bibliográficos
Autores principales: Ferrer-i-Cancho, Ramon, Elvevåg, Brita
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2834740/
https://www.ncbi.nlm.nih.gov/pubmed/20231884
http://dx.doi.org/10.1371/journal.pone.0009411
_version_ 1782178605161250816
author Ferrer-i-Cancho, Ramon
Elvevåg, Brita
author_facet Ferrer-i-Cancho, Ramon
Elvevåg, Brita
author_sort Ferrer-i-Cancho, Ramon
collection PubMed
description BACKGROUND: Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank [Image: see text], the 2nd most frequent word has rank [Image: see text],…) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution. METHODOLOGY/PRINCIPAL FINDINGS: In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text. CONCLUSIONS/SIGNIFICANCE: The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages.
format Text
id pubmed-2834740
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-28347402010-03-16 Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution Ferrer-i-Cancho, Ramon Elvevåg, Brita PLoS One Research Article BACKGROUND: Zipf's law states that the relationship between the frequency of a word in a text and its rank (the most frequent word has rank [Image: see text], the 2nd most frequent word has rank [Image: see text],…) is approximately linear when plotted on a double logarithmic scale. It has been argued that the law is not a relevant or useful property of language because simple random texts - constructed by concatenating random characters including blanks behaving as word delimiters - exhibit a Zipf's law-like word rank distribution. METHODOLOGY/PRINCIPAL FINDINGS: In this article, we examine the flaws of such putative good fits of random texts. We demonstrate - by means of three different statistical tests - that ranks derived from random texts and ranks derived from real texts are statistically inconsistent with the parameters employed to argue for such a good fit, even when the parameters are inferred from the target real text. Our findings are valid for both the simplest random texts composed of equally likely characters as well as more elaborate and realistic versions where character probabilities are borrowed from a real text. CONCLUSIONS/SIGNIFICANCE: The good fit of random texts to real Zipf's law-like rank distributions has not yet been established. Therefore, we suggest that Zipf's law might in fact be a fundamental law in natural languages. Public Library of Science 2010-03-09 /pmc/articles/PMC2834740/ /pubmed/20231884 http://dx.doi.org/10.1371/journal.pone.0009411 Text en This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. https://creativecommons.org/publicdomain/zero/1.0/ This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
spellingShingle Research Article
Ferrer-i-Cancho, Ramon
Elvevåg, Brita
Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
title Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
title_full Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
title_fullStr Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
title_full_unstemmed Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
title_short Random Texts Do Not Exhibit the Real Zipf's Law-Like Rank Distribution
title_sort random texts do not exhibit the real zipf's law-like rank distribution
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2834740/
https://www.ncbi.nlm.nih.gov/pubmed/20231884
http://dx.doi.org/10.1371/journal.pone.0009411
work_keys_str_mv AT ferrericanchoramon randomtextsdonotexhibittherealzipfslawlikerankdistribution
AT elvevagbrita randomtextsdonotexhibittherealzipfslawlikerankdistribution