Cargando…

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion

In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Furth...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Hang, Miller, David
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516783/
https://www.ncbi.nlm.nih.gov/pubmed/33286100
http://dx.doi.org/10.3390/e22030326
_version_ 1783587081157607424
author Wang, Hang
Miller, David
author_facet Wang, Hang
Miller, David
author_sort Wang, Hang
collection PubMed
description In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for all topics—such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM’s model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM.
format Online
Article
Text
id pubmed-7516783
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75167832020-11-09 Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion Wang, Hang Miller, David Entropy (Basel) Article In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for all topics—such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM’s model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM. MDPI 2020-03-12 /pmc/articles/PMC7516783/ /pubmed/33286100 http://dx.doi.org/10.3390/e22030326 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Wang, Hang
Miller, David
Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_full Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_fullStr Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_full_unstemmed Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_short Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_sort improved parsimonious topic modeling based on the bayesian information criterion
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516783/
https://www.ncbi.nlm.nih.gov/pubmed/33286100
http://dx.doi.org/10.3390/e22030326
work_keys_str_mv AT wanghang improvedparsimonioustopicmodelingbasedonthebayesianinformationcriterion
AT millerdavid improvedparsimonioustopicmodelingbasedonthebayesianinformationcriterion