Cargando…

Modeling Statistical Properties of Written Text

Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Oth...

Descripción completa

Detalles Bibliográficos
Autores principales: Serrano, M. Ángeles, Flammini, Alessandro, Menczer, Filippo
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2670513/
https://www.ncbi.nlm.nih.gov/pubmed/19401762
http://dx.doi.org/10.1371/journal.pone.0005372
_version_ 1782166317552369664
author Serrano, M. Ángeles
Flammini, Alessandro
Menczer, Filippo
author_facet Serrano, M. Ángeles
Flammini, Alessandro
Menczer, Filippo
author_sort Serrano, M. Ángeles
collection PubMed
description Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.
format Text
id pubmed-2670513
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-26705132009-04-29 Modeling Statistical Properties of Written Text Serrano, M. Ángeles Flammini, Alessandro Menczer, Filippo PLoS One Research Article Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics. Public Library of Science 2009-04-29 /pmc/articles/PMC2670513/ /pubmed/19401762 http://dx.doi.org/10.1371/journal.pone.0005372 Text en Serrano et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Serrano, M. Ángeles
Flammini, Alessandro
Menczer, Filippo
Modeling Statistical Properties of Written Text
title Modeling Statistical Properties of Written Text
title_full Modeling Statistical Properties of Written Text
title_fullStr Modeling Statistical Properties of Written Text
title_full_unstemmed Modeling Statistical Properties of Written Text
title_short Modeling Statistical Properties of Written Text
title_sort modeling statistical properties of written text
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2670513/
https://www.ncbi.nlm.nih.gov/pubmed/19401762
http://dx.doi.org/10.1371/journal.pone.0005372
work_keys_str_mv AT serranomangeles modelingstatisticalpropertiesofwrittentext
AT flamminialessandro modelingstatisticalpropertiesofwrittentext
AT menczerfilippo modelingstatisticalpropertiesofwrittentext