Cargando…

Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic mode...

Descripción completa

Detalles Bibliográficos
Autores principales: Koltcov, Sergei, Ignatenko, Vera, Boukhers, Zeyd, Staab, Steffen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516868/
https://www.ncbi.nlm.nih.gov/pubmed/33286169
http://dx.doi.org/10.3390/e22040394
_version_ 1783587098215841792
author Koltcov, Sergei
Ignatenko, Vera
Boukhers, Zeyd
Staab, Steffen
author_facet Koltcov, Sergei
Ignatenko, Vera
Boukhers, Zeyd
Staab, Steffen
author_sort Koltcov, Sergei
collection PubMed
description Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.
format Online
Article
Text
id pubmed-7516868
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75168682020-11-09 Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy Koltcov, Sergei Ignatenko, Vera Boukhers, Zeyd Staab, Steffen Entropy (Basel) Article Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research. MDPI 2020-03-30 /pmc/articles/PMC7516868/ /pubmed/33286169 http://dx.doi.org/10.3390/e22040394 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Koltcov, Sergei
Ignatenko, Vera
Boukhers, Zeyd
Staab, Steffen
Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
title Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
title_full Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
title_fullStr Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
title_full_unstemmed Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
title_short Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
title_sort analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of renyi entropy
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516868/
https://www.ncbi.nlm.nih.gov/pubmed/33286169
http://dx.doi.org/10.3390/e22040394
work_keys_str_mv AT koltcovsergei analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy
AT ignatenkovera analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy
AT boukherszeyd analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy
AT staabsteffen analyzingtheinfluenceofhyperparametersandregularizersoftopicmodelingintermsofrenyientropy