Cargando…
Renormalization Analysis of Topic Models
In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this pro...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7517079/ https://www.ncbi.nlm.nih.gov/pubmed/33286328 http://dx.doi.org/10.3390/e22050556 |
_version_ | 1783587147495768064 |
---|---|
author | Koltcov, Sergei Ignatenko, Vera |
author_facet | Koltcov, Sergei Ignatenko, Vera |
author_sort | Koltcov, Sergei |
collection | PubMed |
description | In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic modeling demonstrates self-similar behavior under variation of the number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with the Renyi entropy approach allows for quick searching of the optimal number of topics. In this paper, the renormalization procedure is developed for the probabilistic Latent Semantic Analysis (pLSA), and the Latent Dirichlet Allocation model with variational Expectation–Maximization algorithm (VLDA) and the Latent Dirichlet Allocation model with granulated Gibbs sampling procedure (GLDA). The experiments were conducted on two test datasets with a known number of topics in two different languages and on one unlabeled test dataset with an unknown number of topics. The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality. |
format | Online Article Text |
id | pubmed-7517079 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-75170792020-11-09 Renormalization Analysis of Topic Models Koltcov, Sergei Ignatenko, Vera Entropy (Basel) Article In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic modeling demonstrates self-similar behavior under variation of the number of clusters. Such behavior allows using a renormalization technique. A combination of renormalization procedure with the Renyi entropy approach allows for quick searching of the optimal number of topics. In this paper, the renormalization procedure is developed for the probabilistic Latent Semantic Analysis (pLSA), and the Latent Dirichlet Allocation model with variational Expectation–Maximization algorithm (VLDA) and the Latent Dirichlet Allocation model with granulated Gibbs sampling procedure (GLDA). The experiments were conducted on two test datasets with a known number of topics in two different languages and on one unlabeled test dataset with an unknown number of topics. The paper shows that the renormalization procedure allows for finding an approximation of the optimal number of topics at least 30 times faster than the grid search without significant loss of quality. MDPI 2020-05-16 /pmc/articles/PMC7517079/ /pubmed/33286328 http://dx.doi.org/10.3390/e22050556 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Koltcov, Sergei Ignatenko, Vera Renormalization Analysis of Topic Models |
title | Renormalization Analysis of Topic Models |
title_full | Renormalization Analysis of Topic Models |
title_fullStr | Renormalization Analysis of Topic Models |
title_full_unstemmed | Renormalization Analysis of Topic Models |
title_short | Renormalization Analysis of Topic Models |
title_sort | renormalization analysis of topic models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7517079/ https://www.ncbi.nlm.nih.gov/pubmed/33286328 http://dx.doi.org/10.3390/e22050556 |
work_keys_str_mv | AT koltcovsergei renormalizationanalysisoftopicmodels AT ignatenkovera renormalizationanalysisoftopicmodels |