Cargando…

Rank Diversity of Languages: Generic Behavior in Computational Linguistics

Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find t...

Descripción completa

Detalles Bibliográficos
Autores principales: Cocho, Germinal, Flores, Jorge, Gershenson, Carlos, Pineda, Carlos, Sánchez, Sergio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4388647/
https://www.ncbi.nlm.nih.gov/pubmed/25849150
http://dx.doi.org/10.1371/journal.pone.0121898
_version_ 1782365417215361024
author Cocho, Germinal
Flores, Jorge
Gershenson, Carlos
Pineda, Carlos
Sánchez, Sergio
author_facet Cocho, Germinal
Flores, Jorge
Gershenson, Carlos
Pineda, Carlos
Sánchez, Sergio
author_sort Cocho, Germinal
collection PubMed
description Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: “heads” consist of words which almost do not change their rank in time, “bodies” are words of general use, while “tails” are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied.
format Online
Article
Text
id pubmed-4388647
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-43886472015-04-21 Rank Diversity of Languages: Generic Behavior in Computational Linguistics Cocho, Germinal Flores, Jorge Gershenson, Carlos Pineda, Carlos Sánchez, Sergio PLoS One Research Article Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution rank diversity. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: “heads” consist of words which almost do not change their rank in time, “bodies” are words of general use, while “tails” are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied. Public Library of Science 2015-04-07 /pmc/articles/PMC4388647/ /pubmed/25849150 http://dx.doi.org/10.1371/journal.pone.0121898 Text en © 2015 Cocho et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Cocho, Germinal
Flores, Jorge
Gershenson, Carlos
Pineda, Carlos
Sánchez, Sergio
Rank Diversity of Languages: Generic Behavior in Computational Linguistics
title Rank Diversity of Languages: Generic Behavior in Computational Linguistics
title_full Rank Diversity of Languages: Generic Behavior in Computational Linguistics
title_fullStr Rank Diversity of Languages: Generic Behavior in Computational Linguistics
title_full_unstemmed Rank Diversity of Languages: Generic Behavior in Computational Linguistics
title_short Rank Diversity of Languages: Generic Behavior in Computational Linguistics
title_sort rank diversity of languages: generic behavior in computational linguistics
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4388647/
https://www.ncbi.nlm.nih.gov/pubmed/25849150
http://dx.doi.org/10.1371/journal.pone.0121898
work_keys_str_mv AT cochogerminal rankdiversityoflanguagesgenericbehaviorincomputationallinguistics
AT floresjorge rankdiversityoflanguagesgenericbehaviorincomputationallinguistics
AT gershensoncarlos rankdiversityoflanguagesgenericbehaviorincomputationallinguistics
AT pinedacarlos rankdiversityoflanguagesgenericbehaviorincomputationallinguistics
AT sanchezsergio rankdiversityoflanguagesgenericbehaviorincomputationallinguistics