Cargando…
Modeling Statistical Properties of Written Text
Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Oth...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2670513/ https://www.ncbi.nlm.nih.gov/pubmed/19401762 http://dx.doi.org/10.1371/journal.pone.0005372 |
_version_ | 1782166317552369664 |
---|---|
author | Serrano, M. Ángeles Flammini, Alessandro Menczer, Filippo |
author_facet | Serrano, M. Ángeles Flammini, Alessandro Menczer, Filippo |
author_sort | Serrano, M. Ángeles |
collection | PubMed |
description | Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics. |
format | Text |
id | pubmed-2670513 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-26705132009-04-29 Modeling Statistical Properties of Written Text Serrano, M. Ángeles Flammini, Alessandro Menczer, Filippo PLoS One Research Article Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics. Public Library of Science 2009-04-29 /pmc/articles/PMC2670513/ /pubmed/19401762 http://dx.doi.org/10.1371/journal.pone.0005372 Text en Serrano et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Serrano, M. Ángeles Flammini, Alessandro Menczer, Filippo Modeling Statistical Properties of Written Text |
title | Modeling Statistical Properties of Written Text |
title_full | Modeling Statistical Properties of Written Text |
title_fullStr | Modeling Statistical Properties of Written Text |
title_full_unstemmed | Modeling Statistical Properties of Written Text |
title_short | Modeling Statistical Properties of Written Text |
title_sort | modeling statistical properties of written text |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2670513/ https://www.ncbi.nlm.nih.gov/pubmed/19401762 http://dx.doi.org/10.1371/journal.pone.0005372 |
work_keys_str_mv | AT serranomangeles modelingstatisticalpropertiesofwrittentext AT flamminialessandro modelingstatisticalpropertiesofwrittentext AT menczerfilippo modelingstatisticalpropertiesofwrittentext |