Cargando…

Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words

A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characteri...

Descripción completa

Detalles Bibliográficos
Autores principales: Tanaka-Ishii, Kumiko, Bunde, Armin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5125566/
https://www.ncbi.nlm.nih.gov/pubmed/27893737
http://dx.doi.org/10.1371/journal.pone.0164658
_version_ 1782469980860710912
author Tanaka-Ishii, Kumiko
Bunde, Armin
author_facet Tanaka-Ishii, Kumiko
Bunde, Armin
author_sort Tanaka-Ishii, Kumiko
collection PubMed
description A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characterized by the arrangement of the rare words in the text, can be quantified mathematically in a similar basic way. To this end, we consider representative classic single-authored texts from England/Ireland, France, Germany, China, and Japan. In each text, we classify each word by its rank. We focus on the rare words with ranks above some threshold Q and study the lengths of the (return) intervals between them. We find that for all texts considered, the probability S(Q)(r) that the length of an interval exceeds r, follows a perfect Weibull-function, S(Q)(r) = exp(−b(β)r(β)), with β around 0.7. The return intervals themselves are arranged in a long-range correlated self-similar fashion, where the autocorrelation function C(Q)(s) of the intervals follows a power law, C(Q)(s) ∼ s(−γ), with an exponent γ between 0.14 and 0.48. We show that these features lead to a pronounced clustering of the rare words in the text.
format Online
Article
Text
id pubmed-5125566
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-51255662016-12-15 Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words Tanaka-Ishii, Kumiko Bunde, Armin PLoS One Research Article A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf’s law). Here we address the complementary question, if also the rhythm of the text, characterized by the arrangement of the rare words in the text, can be quantified mathematically in a similar basic way. To this end, we consider representative classic single-authored texts from England/Ireland, France, Germany, China, and Japan. In each text, we classify each word by its rank. We focus on the rare words with ranks above some threshold Q and study the lengths of the (return) intervals between them. We find that for all texts considered, the probability S(Q)(r) that the length of an interval exceeds r, follows a perfect Weibull-function, S(Q)(r) = exp(−b(β)r(β)), with β around 0.7. The return intervals themselves are arranged in a long-range correlated self-similar fashion, where the autocorrelation function C(Q)(s) of the intervals follows a power law, C(Q)(s) ∼ s(−γ), with an exponent γ between 0.14 and 0.48. We show that these features lead to a pronounced clustering of the rare words in the text. Public Library of Science 2016-11-28 /pmc/articles/PMC5125566/ /pubmed/27893737 http://dx.doi.org/10.1371/journal.pone.0164658 Text en © 2016 Tanaka-Ishii, Bunde http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Tanaka-Ishii, Kumiko
Bunde, Armin
Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words
title Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words
title_full Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words
title_fullStr Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words
title_full_unstemmed Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words
title_short Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words
title_sort long-range memory in literary texts: on the universal clustering of the rare words
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5125566/
https://www.ncbi.nlm.nih.gov/pubmed/27893737
http://dx.doi.org/10.1371/journal.pone.0164658
work_keys_str_mv AT tanakaishiikumiko longrangememoryinliterarytextsontheuniversalclusteringoftherarewords
AT bundearmin longrangememoryinliterarytextsontheuniversalclusteringoftherarewords