Cargando…
Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic mode...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516868/ https://www.ncbi.nlm.nih.gov/pubmed/33286169 http://dx.doi.org/10.3390/e22040394 |
_version_ | 1783587098215841792 |
---|---|
author | Koltcov, Sergei Ignatenko, Vera Boukhers, Zeyd Staab, Steffen |
author_facet | Koltcov, Sergei Ignatenko, Vera Boukhers, Zeyd Staab, Steffen |
author_sort | Koltcov, Sergei |
collection | PubMed |
description | Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research. |
format | Online Article Text |
id | pubmed-7516868 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-75168682020-11-09 Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy Koltcov, Sergei Ignatenko, Vera Boukhers, Zeyd Staab, Steffen Entropy (Basel) Article Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research. MDPI 2020-03-30 /pmc/articles/PMC7516868/ /pubmed/33286169 http://dx.doi.org/10.3390/e22040394 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Koltcov, Sergei Ignatenko, Vera Boukhers, Zeyd Staab, Steffen Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy |
title | Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy |
title_full | Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy |
title_fullStr | Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy |
title_full_unstemmed | Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy |
title_short | Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy |
title_sort | analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of renyi entropy |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516868/ https://www.ncbi.nlm.nih.gov/pubmed/33286169 http://dx.doi.org/10.3390/e22040394 |
work_keys_str_mv | AT koltcovsergei analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy AT ignatenkovera analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy AT boukherszeyd analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy AT staabsteffen analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy |