Cargando…

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far ei...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gerlach, Martin, Font-Clos, Francesc
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516435/ https://www.ncbi.nlm.nih.gov/pubmed/33285901 http://dx.doi.org/10.3390/e22010126

_version_	1783587000526307328
author	Gerlach, Martin Font-Clos, Francesc
author_facet	Gerlach, Martin Font-Clos, Francesc
author_sort	Gerlach, Martin
collection	PubMed
description	The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than [Formula: see text] word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
format	Online Article Text
id	pubmed-7516435
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75164352020-11-09 A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics Gerlach, Martin Font-Clos, Francesc Entropy (Basel) Article The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than [Formula: see text] word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on three different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval. MDPI 2020-01-20 /pmc/articles/PMC7516435/ /pubmed/33285901 http://dx.doi.org/10.3390/e22010126 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Gerlach, Martin Font-Clos, Francesc A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_full	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_fullStr	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_full_unstemmed	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_short	A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
title_sort	standardized project gutenberg corpus for statistical analysis of natural language and quantitative linguistics
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516435/ https://www.ncbi.nlm.nih.gov/pubmed/33285901 http://dx.doi.org/10.3390/e22010126
work_keys_str_mv	AT gerlachmartin astandardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics AT fontclosfrancesc astandardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics AT gerlachmartin standardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics AT fontclosfrancesc standardizedprojectgutenbergcorpusforstatisticalanalysisofnaturallanguageandquantitativelinguistics

A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics

Ejemplares similares