Cargando…

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion

In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Furth...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Hang, Miller, David
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516783/ https://www.ncbi.nlm.nih.gov/pubmed/33286100 http://dx.doi.org/10.3390/e22030326

_version_	1783587081157607424
author	Wang, Hang Miller, David
author_facet	Wang, Hang Miller, David
author_sort	Wang, Hang
collection	PubMed
description	In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for all topics—such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM’s model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM.
format	Online Article Text
id	pubmed-7516783
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75167832020-11-09 Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion Wang, Hang Miller, David Entropy (Basel) Article In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for all topics—such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM’s model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM. MDPI 2020-03-12 /pmc/articles/PMC7516783/ /pubmed/33286100 http://dx.doi.org/10.3390/e22030326 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Wang, Hang Miller, David Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title	Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_full	Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_fullStr	Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_full_unstemmed	Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_short	Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion
title_sort	improved parsimonious topic modeling based on the bayesian information criterion
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516783/ https://www.ncbi.nlm.nih.gov/pubmed/33286100 http://dx.doi.org/10.3390/e22030326
work_keys_str_mv	AT wanghang improvedparsimonioustopicmodelingbasedonthebayesianinformationcriterion AT millerdavid improvedparsimonioustopicmodelingbasedonthebayesianinformationcriterion

Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion

Ejemplares similares