Cargando…
An Empirical Model for n-gram Frequency Distribution in Large Corpora
Statistical multiword extraction methods can benefit from the knowledge on the n-gram ([Formula: see text]) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation o...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206297/ http://dx.doi.org/10.1007/978-3-030-47436-2_63 |
Sumario: | Statistical multiword extraction methods can benefit from the knowledge on the n-gram ([Formula: see text]) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the number of distinct n-grams in large corpora, as well as the sizes of the equal-frequency n-gram groups, which occur in the lower frequencies starting from 1. The model was validated for n-grams with [Formula: see text], by a wide range of real corpora in English and French, from 60 million up to 8 billion words. These are full non-truncated corpora data, that is, their associated frequency data include the entire range of observed n-gram frequencies, from 1 up to the maximum. The model predicts the monotonic growth of the numbers of distinct n-grams until reaching asymptotic plateaux when the corpus size grows to infinity. It also predicts the non-monotonicity of the sizes of the equal-frequency n-gram groups as a function of the corpus size. |
---|