Cargando…

Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora

Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly corr...

Descripción completa

Detalles Bibliográficos
Autor principal: Levshina, Natalia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8870940/
https://www.ncbi.nlm.nih.gov/pubmed/35205578
http://dx.doi.org/10.3390/e24020280
_version_ 1784656877226819584
author Levshina, Natalia
author_facet Levshina, Natalia
author_sort Levshina, Natalia
collection PubMed
description Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.
format Online
Article
Text
id pubmed-8870940
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-88709402022-02-25 Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora Levshina, Natalia Entropy (Basel) Article Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions. MDPI 2022-02-16 /pmc/articles/PMC8870940/ /pubmed/35205578 http://dx.doi.org/10.3390/e24020280 Text en © 2022 by the author. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Levshina, Natalia
Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
title Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
title_full Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
title_fullStr Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
title_full_unstemmed Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
title_short Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
title_sort frequency, informativity and word length: insights from typologically diverse corpora
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8870940/
https://www.ncbi.nlm.nih.gov/pubmed/35205578
http://dx.doi.org/10.3390/e24020280
work_keys_str_mv AT levshinanatalia frequencyinformativityandwordlengthinsightsfromtypologicallydiversecorpora