Cargando…

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution

It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pechenick, Eitan Adam, Danforth, Christopher M., Dodds, Peter Sheridan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4596490/ https://www.ncbi.nlm.nih.gov/pubmed/26445406 http://dx.doi.org/10.1371/journal.pone.0137041

_version_	1782393778130124800
author	Pechenick, Eitan Adam Danforth, Christopher M. Dodds, Peter Sheridan
author_facet	Pechenick, Eitan Adam Danforth, Christopher M. Dodds, Peter Sheridan
author_sort	Pechenick, Eitan Adam
collection	PubMed
description	It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.
format	Online Article Text
id	pubmed-4596490
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-45964902015-10-20 Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution Pechenick, Eitan Adam Danforth, Christopher M. Dodds, Peter Sheridan PLoS One Research Article It is tempting to treat frequency trends from the Google Books data sets as indicators of the “true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We use information theoretic methods to highlight these dynamics by examining and comparing major contributions via a divergence measure of English data sets between decades in the period 1800–2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts. Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution. Public Library of Science 2015-10-07 /pmc/articles/PMC4596490/ /pubmed/26445406 http://dx.doi.org/10.1371/journal.pone.0137041 Text en © 2015 Pechenick et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Pechenick, Eitan Adam Danforth, Christopher M. Dodds, Peter Sheridan Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution
title	Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution
title_full	Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution
title_fullStr	Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution
title_full_unstemmed	Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution
title_short	Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution
title_sort	characterizing the google books corpus: strong limits to inferences of socio-cultural and linguistic evolution
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4596490/ https://www.ncbi.nlm.nih.gov/pubmed/26445406 http://dx.doi.org/10.1371/journal.pone.0137041
work_keys_str_mv	AT pechenickeitanadam characterizingthegooglebookscorpusstronglimitstoinferencesofsocioculturalandlinguisticevolution AT danforthchristopherm characterizingthegooglebookscorpusstronglimitstoinferencesofsocioculturalandlinguisticevolution AT doddspetersheridan characterizingthegooglebookscorpusstronglimitstoinferencesofsocioculturalandlinguisticevolution

Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution

Ejemplares similares