Cargando…

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Statistical multiword extraction methods can benefit from the knowledge on the n-gram ([Formula: see text]) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Silva, Joaquim F., Cunha, Jose C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206297/ http://dx.doi.org/10.1007/978-3-030-47436-2_63

_version_	1783530388746928128
author	Silva, Joaquim F. Cunha, Jose C.
author_facet	Silva, Joaquim F. Cunha, Jose C.
author_sort	Silva, Joaquim F.
collection	PubMed
description	Statistical multiword extraction methods can benefit from the knowledge on the n-gram ([Formula: see text]) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the number of distinct n-grams in large corpora, as well as the sizes of the equal-frequency n-gram groups, which occur in the lower frequencies starting from 1. The model was validated for n-grams with [Formula: see text], by a wide range of real corpora in English and French, from 60 million up to 8 billion words. These are full non-truncated corpora data, that is, their associated frequency data include the entire range of observed n-gram frequencies, from 1 up to the maximum. The model predicts the monotonic growth of the numbers of distinct n-grams until reaching asymptotic plateaux when the corpus size grows to infinity. It also predicts the non-monotonicity of the sizes of the equal-frequency n-gram groups as a function of the corpus size.
format	Online Article Text
id	pubmed-7206297
institution	National Center for Biotechnology Information
language	English
publishDate	2020
record_format	MEDLINE/PubMed
spelling	pubmed-72062972020-05-08 An Empirical Model for n-gram Frequency Distribution in Large Corpora Silva, Joaquim F. Cunha, Jose C. Advances in Knowledge Discovery and Data Mining Article Statistical multiword extraction methods can benefit from the knowledge on the n-gram ([Formula: see text]) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the number of distinct n-grams in large corpora, as well as the sizes of the equal-frequency n-gram groups, which occur in the lower frequencies starting from 1. The model was validated for n-grams with [Formula: see text], by a wide range of real corpora in English and French, from 60 million up to 8 billion words. These are full non-truncated corpora data, that is, their associated frequency data include the entire range of observed n-gram frequencies, from 1 up to the maximum. The model predicts the monotonic growth of the numbers of distinct n-grams until reaching asymptotic plateaux when the corpus size grows to infinity. It also predicts the non-monotonicity of the sizes of the equal-frequency n-gram groups as a function of the corpus size. 2020-04-17 /pmc/articles/PMC7206297/ http://dx.doi.org/10.1007/978-3-030-47436-2_63 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Silva, Joaquim F. Cunha, Jose C. An Empirical Model for n-gram Frequency Distribution in Large Corpora
title	An Empirical Model for n-gram Frequency Distribution in Large Corpora
title_full	An Empirical Model for n-gram Frequency Distribution in Large Corpora
title_fullStr	An Empirical Model for n-gram Frequency Distribution in Large Corpora
title_full_unstemmed	An Empirical Model for n-gram Frequency Distribution in Large Corpora
title_short	An Empirical Model for n-gram Frequency Distribution in Large Corpora
title_sort	empirical model for n-gram frequency distribution in large corpora
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206297/ http://dx.doi.org/10.1007/978-3-030-47436-2_63
work_keys_str_mv	AT silvajoaquimf anempiricalmodelforngramfrequencydistributioninlargecorpora AT cunhajosec anempiricalmodelforngramfrequencydistributioninlargecorpora AT silvajoaquimf empiricalmodelforngramfrequencydistributioninlargecorpora AT cunhajosec empiricalmodelforngramfrequencydistributioninlargecorpora

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Ejemplares similares