Cargando…

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts

With the rapid proliferation of social networking sites (SNS), automatic topic extraction from various text messages posted on SNS are becoming an important source of information for understanding current social trends or needs. Latent Dirichlet Allocation (LDA), a probabilistic generative model, is...

Descripción completa

Detalles Bibliográficos
Autores principales:	Murakami, Riki, Chakraborty, Basabi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8840106/ https://www.ncbi.nlm.nih.gov/pubmed/35161598 http://dx.doi.org/10.3390/s22030852

_version_	1784650536642936832
author	Murakami, Riki Chakraborty, Basabi
author_facet	Murakami, Riki Chakraborty, Basabi
author_sort	Murakami, Riki
collection	PubMed
description	With the rapid proliferation of social networking sites (SNS), automatic topic extraction from various text messages posted on SNS are becoming an important source of information for understanding current social trends or needs. Latent Dirichlet Allocation (LDA), a probabilistic generative model, is one of the popular topic models in the area of Natural Language Processing (NLP) and has been widely used in information retrieval, topic extraction, and document analysis. Unlike long texts from formal documents, messages on SNS are generally short. Traditional topic models such as LDA or pLSA (probabilistic latent semantic analysis) suffer performance degradation for short-text analysis due to a lack of word co-occurrence information in each short text. To cope with this problem, various techniques are evolving for interpretable topic modeling for short texts, pretrained word embedding with an external corpus combined with topic models is one of them. Due to recent developments of deep neural networks (DNN) and deep generative models, neural-topic models (NTM) are emerging to achieve flexibility and high performance in topic modeling. However, there are very few research works on neural-topic models with pretrained word embedding for generating high-quality topics from short texts. In this work, in addition to pretrained word embedding, a fine-tuning stage with an original corpus is proposed for training neural-topic models in order to generate semantically coherent, corpus-specific topics. An extensive study with eight neural-topic models has been completed to check the effectiveness of additional fine-tuning and pretrained word embedding in generating interpretable topics by simulation experiments with several benchmark datasets. The extracted topics are evaluated by different metrics of topic coherence and topic diversity. We have also studied the performance of the models in classification and clustering tasks. Our study concludes that though auxiliary word embedding with a large external corpus improves the topic coherency of short texts, an additional fine-tuning stage is needed for generating more corpus-specific topics from short-text data.
format	Online Article Text
id	pubmed-8840106
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-88401062022-02-13 Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts Murakami, Riki Chakraborty, Basabi Sensors (Basel) Article With the rapid proliferation of social networking sites (SNS), automatic topic extraction from various text messages posted on SNS are becoming an important source of information for understanding current social trends or needs. Latent Dirichlet Allocation (LDA), a probabilistic generative model, is one of the popular topic models in the area of Natural Language Processing (NLP) and has been widely used in information retrieval, topic extraction, and document analysis. Unlike long texts from formal documents, messages on SNS are generally short. Traditional topic models such as LDA or pLSA (probabilistic latent semantic analysis) suffer performance degradation for short-text analysis due to a lack of word co-occurrence information in each short text. To cope with this problem, various techniques are evolving for interpretable topic modeling for short texts, pretrained word embedding with an external corpus combined with topic models is one of them. Due to recent developments of deep neural networks (DNN) and deep generative models, neural-topic models (NTM) are emerging to achieve flexibility and high performance in topic modeling. However, there are very few research works on neural-topic models with pretrained word embedding for generating high-quality topics from short texts. In this work, in addition to pretrained word embedding, a fine-tuning stage with an original corpus is proposed for training neural-topic models in order to generate semantically coherent, corpus-specific topics. An extensive study with eight neural-topic models has been completed to check the effectiveness of additional fine-tuning and pretrained word embedding in generating interpretable topics by simulation experiments with several benchmark datasets. The extracted topics are evaluated by different metrics of topic coherence and topic diversity. We have also studied the performance of the models in classification and clustering tasks. Our study concludes that though auxiliary word embedding with a large external corpus improves the topic coherency of short texts, an additional fine-tuning stage is needed for generating more corpus-specific topics from short-text data. MDPI 2022-01-23 /pmc/articles/PMC8840106/ /pubmed/35161598 http://dx.doi.org/10.3390/s22030852 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Murakami, Riki Chakraborty, Basabi Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts
title	Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts
title_full	Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts
title_fullStr	Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts
title_full_unstemmed	Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts
title_short	Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts
title_sort	investigating the efficient use of word embedding with neural-topic models for interpretable topics from short texts
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8840106/ https://www.ncbi.nlm.nih.gov/pubmed/35161598 http://dx.doi.org/10.3390/s22030852
work_keys_str_mv	AT murakamiriki investigatingtheefficientuseofwordembeddingwithneuraltopicmodelsforinterpretabletopicsfromshorttexts AT chakrabortybasabi investigatingtheefficientuseofwordembeddingwithneuraltopicmodelsforinterpretabletopicsfromshorttexts

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts

Ejemplares similares