Cargando…

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA to...

Descripción completa

Detalles Bibliográficos
Autores principales:	Weisser, Christoph, Gerloff, Christoph, Thielmann, Anton, Python, Andre, Reuter, Arik, Kneib, Thomas, Säfken, Benjamin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Berlin Heidelberg 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10060035/ https://www.ncbi.nlm.nih.gov/pubmed/37223721 http://dx.doi.org/10.1007/s00180-022-01246-z

_version_	1785017021942988800
author	Weisser, Christoph Gerloff, Christoph Thielmann, Anton Python, Andre Reuter, Arik Kneib, Thomas Säfken, Benjamin
author_facet	Weisser, Christoph Gerloff, Christoph Thielmann, Anton Python, Andre Reuter, Arik Kneib, Thomas Säfken, Benjamin
author_sort	Weisser, Christoph
collection	PubMed
description	Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.
format	Online Article Text
id	pubmed-10060035
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer Berlin Heidelberg
record_format	MEDLINE/PubMed
spelling	pubmed-100600352023-03-30 Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data Weisser, Christoph Gerloff, Christoph Thielmann, Anton Python, Andre Reuter, Arik Kneib, Thomas Säfken, Benjamin Comput Stat Original Paper Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model. Springer Berlin Heidelberg 2022-07-09 2023 /pmc/articles/PMC10060035/ /pubmed/37223721 http://dx.doi.org/10.1007/s00180-022-01246-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Original Paper Weisser, Christoph Gerloff, Christoph Thielmann, Anton Python, Andre Reuter, Arik Kneib, Thomas Säfken, Benjamin Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
title	Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
title_full	Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
title_fullStr	Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
title_full_unstemmed	Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
title_short	Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
title_sort	pseudo-document simulation for comparing lda, gsdmm and gpm topic models on short and sparse text using twitter data
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10060035/ https://www.ncbi.nlm.nih.gov/pubmed/37223721 http://dx.doi.org/10.1007/s00180-022-01246-z
work_keys_str_mv	AT weisserchristoph pseudodocumentsimulationforcomparingldagsdmmandgpmtopicmodelsonshortandsparsetextusingtwitterdata AT gerloffchristoph pseudodocumentsimulationforcomparingldagsdmmandgpmtopicmodelsonshortandsparsetextusingtwitterdata AT thielmannanton pseudodocumentsimulationforcomparingldagsdmmandgpmtopicmodelsonshortandsparsetextusingtwitterdata AT pythonandre pseudodocumentsimulationforcomparingldagsdmmandgpmtopicmodelsonshortandsparsetextusingtwitterdata AT reuterarik pseudodocumentsimulationforcomparingldagsdmmandgpmtopicmodelsonshortandsparsetextusingtwitterdata AT kneibthomas pseudodocumentsimulationforcomparingldagsdmmandgpmtopicmodelsonshortandsparsetextusingtwitterdata AT safkenbenjamin pseudodocumentsimulationforcomparingldagsdmmandgpmtopicmodelsonshortandsparsetextusingtwitterdata

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

Ejemplares similares