Cargando…

The natural selection of words: Finding the features of fitness

We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Turney, Peter D., Mohammad, Saif M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6349325/ https://www.ncbi.nlm.nih.gov/pubmed/30689665 http://dx.doi.org/10.1371/journal.pone.0211512

_version_	1783390255889514496
author	Turney, Peter D. Mohammad, Saif M.
author_facet	Turney, Peter D. Mohammad, Saif M.
author_sort	Turney, Peter D.
collection	PubMed
description	We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word’s length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves.
format	Online Article Text
id	pubmed-6349325
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-63493252019-02-15 The natural selection of words: Finding the features of fitness Turney, Peter D. Mohammad, Saif M. PLoS One Research Article We introduce a dataset for studying the evolution of words, constructed from WordNet and the Google Books Ngram Corpus. The dataset tracks the evolution of 4,000 synonym sets (synsets), containing 9,000 English words, from 1800 AD to 2000 AD. We present a supervised learning algorithm that is able to predict the future leader of a synset: the word in the synset that will have the highest frequency. The algorithm uses features based on a word’s length, the characters in the word, and the historical frequencies of the word. It can predict change of leadership (including the identity of the new leader) fifty years in the future, with an F-score considerably above random guessing. Analysis of the learned models provides insight into the causes of change in the leader of a synset. The algorithm confirms observations linguists have made, such as the trend to replace the -ise suffix with -ize, the rivalry between the -ity and -ness suffixes, and the struggle between economy (shorter words are easier to remember and to write) and clarity (longer words are more distinctive and less likely to be confused with one another). The results indicate that integration of the Google Books Ngram Corpus with WordNet has significant potential for improving our understanding of how language evolves. Public Library of Science 2019-01-28 /pmc/articles/PMC6349325/ /pubmed/30689665 http://dx.doi.org/10.1371/journal.pone.0211512 Text en © 2019 Turney, Mohammad http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Turney, Peter D. Mohammad, Saif M. The natural selection of words: Finding the features of fitness
title	The natural selection of words: Finding the features of fitness
title_full	The natural selection of words: Finding the features of fitness
title_fullStr	The natural selection of words: Finding the features of fitness
title_full_unstemmed	The natural selection of words: Finding the features of fitness
title_short	The natural selection of words: Finding the features of fitness
title_sort	natural selection of words: finding the features of fitness
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6349325/ https://www.ncbi.nlm.nih.gov/pubmed/30689665 http://dx.doi.org/10.1371/journal.pone.0211512
work_keys_str_mv	AT turneypeterd thenaturalselectionofwordsfindingthefeaturesoffitness AT mohammadsaifm thenaturalselectionofwordsfindingthefeaturesoffitness AT turneypeterd naturalselectionofwordsfindingthefeaturesoffitness AT mohammadsaifm naturalselectionofwordsfindingthefeaturesoffitness

The natural selection of words: Finding the features of fitness

Ejemplares similares