Cargando…

Analysis and tuning of hierarchical topic models based on Renyi entropy approach

Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an a...

Descripción completa

Detalles Bibliográficos
Autores principales: Koltcov, Sergei, Ignatenko, Vera, Terpilovskii, Maxim, Rosso, Paolo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8330431/
https://www.ncbi.nlm.nih.gov/pubmed/34401473
http://dx.doi.org/10.7717/peerj-cs.608
_version_ 1783732715621711872
author Koltcov, Sergei
Ignatenko, Vera
Terpilovskii, Maxim
Rosso, Paolo
author_facet Koltcov, Sergei
Ignatenko, Vera
Terpilovskii, Maxim
Rosso, Paolo
author_sort Koltcov, Sergei
collection PubMed
description Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose an approach based on Renyi entropy as a partial solution to the above problem. First, we introduce a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical approach to obtaining the “correct” number of topics in hierarchical topic models and show how model hyperparameters should be tuned for that purpose. We test this approach on the datasets with the known number of topics, as determined by the human mark-up, three of these datasets being in the English language and one in Russian. In the numerical experiments, we consider three different hierarchical models: hierarchical latent Dirichlet allocation model (hLDA), hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that the hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far from the true numbers for the labeled datasets. For the hPAM model, the Renyi entropy approach allows determining only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two levels of hierarchy.
format Online
Article
Text
id pubmed-8330431
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-83304312021-08-15 Analysis and tuning of hierarchical topic models based on Renyi entropy approach Koltcov, Sergei Ignatenko, Vera Terpilovskii, Maxim Rosso, Paolo PeerJ Comput Sci Data Mining and Machine Learning Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose an approach based on Renyi entropy as a partial solution to the above problem. First, we introduce a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical approach to obtaining the “correct” number of topics in hierarchical topic models and show how model hyperparameters should be tuned for that purpose. We test this approach on the datasets with the known number of topics, as determined by the human mark-up, three of these datasets being in the English language and one in Russian. In the numerical experiments, we consider three different hierarchical models: hierarchical latent Dirichlet allocation model (hLDA), hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that the hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far from the true numbers for the labeled datasets. For the hPAM model, the Renyi entropy approach allows determining only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two levels of hierarchy. PeerJ Inc. 2021-07-29 /pmc/articles/PMC8330431/ /pubmed/34401473 http://dx.doi.org/10.7717/peerj-cs.608 Text en © 2021 Koltcov et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Data Mining and Machine Learning
Koltcov, Sergei
Ignatenko, Vera
Terpilovskii, Maxim
Rosso, Paolo
Analysis and tuning of hierarchical topic models based on Renyi entropy approach
title Analysis and tuning of hierarchical topic models based on Renyi entropy approach
title_full Analysis and tuning of hierarchical topic models based on Renyi entropy approach
title_fullStr Analysis and tuning of hierarchical topic models based on Renyi entropy approach
title_full_unstemmed Analysis and tuning of hierarchical topic models based on Renyi entropy approach
title_short Analysis and tuning of hierarchical topic models based on Renyi entropy approach
title_sort analysis and tuning of hierarchical topic models based on renyi entropy approach
topic Data Mining and Machine Learning
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8330431/
https://www.ncbi.nlm.nih.gov/pubmed/34401473
http://dx.doi.org/10.7717/peerj-cs.608
work_keys_str_mv AT koltcovsergei analysisandtuningofhierarchicaltopicmodelsbasedonrenyientropyapproach
AT ignatenkovera analysisandtuningofhierarchicaltopicmodelsbasedonrenyientropyapproach
AT terpilovskiimaxim analysisandtuningofhierarchicaltopicmodelsbasedonrenyientropyapproach
AT rossopaolo analysisandtuningofhierarchicaltopicmodelsbasedonrenyientropyapproach