Cargando…

Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts

Zipf’s law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf’s law should hold in the most natural way, studying its validity for plain word forms and for the corr...

Descripción completa

Detalles Bibliográficos
Autores principales: Corral, Álvaro, Boleda, Gemma, Ferrer-i-Cancho, Ramon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4497678/
https://www.ncbi.nlm.nih.gov/pubmed/26158787
http://dx.doi.org/10.1371/journal.pone.0129031
_version_ 1782380542828740608
author Corral, Álvaro
Boleda, Gemma
Ferrer-i-Cancho, Ramon
author_facet Corral, Álvaro
Boleda, Gemma
Ferrer-i-Cancho, Ramon
author_sort Corral, Álvaro
collection PubMed
description Zipf’s law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf’s law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf’s law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf’s law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation.
format Online
Article
Text
id pubmed-4497678
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-44976782015-07-14 Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts Corral, Álvaro Boleda, Gemma Ferrer-i-Cancho, Ramon PLoS One Research Article Zipf’s law is a fundamental paradigm in the statistics of written and spoken natural language as well as in other communication systems. We raise the question of the elementary units for which Zipf’s law should hold in the most natural way, studying its validity for plain word forms and for the corresponding lemma forms. We analyze several long literary texts comprising four languages, with different levels of morphological complexity. In all cases Zipf’s law is fulfilled, in the sense that a power-law distribution of word or lemma frequencies is valid for several orders of magnitude. We investigate the extent to which the word-lemma transformation preserves two parameters of Zipf’s law: the exponent and the low-frequency cut-off. We are not able to demonstrate a strict invariance of the tail, as for a few texts both exponents deviate significantly, but we conclude that the exponents are very similar, despite the remarkable transformation that going from words to lemmas represents, considerably affecting all ranges of frequencies. In contrast, the low-frequency cut-offs are less stable, tending to increase substantially after the transformation. Public Library of Science 2015-07-09 /pmc/articles/PMC4497678/ /pubmed/26158787 http://dx.doi.org/10.1371/journal.pone.0129031 Text en © 2015 Corral et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Corral, Álvaro
Boleda, Gemma
Ferrer-i-Cancho, Ramon
Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
title Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
title_full Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
title_fullStr Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
title_full_unstemmed Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
title_short Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts
title_sort zipf’s law for word frequencies: word forms versus lemmas in long texts
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4497678/
https://www.ncbi.nlm.nih.gov/pubmed/26158787
http://dx.doi.org/10.1371/journal.pone.0129031
work_keys_str_mv AT corralalvaro zipfslawforwordfrequencieswordformsversuslemmasinlongtexts
AT boledagemma zipfslawforwordfrequencieswordformsversuslemmasinlongtexts
AT ferrericanchoramon zipfslawforwordfrequencieswordformsversuslemmasinlongtexts