Cargando…

An integrated clustering and BERT framework for improved topic modeling

Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a...

Descripción completa

Detalles Bibliográficos
Autores principales: George, Lijimol, Sumathy, P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Nature Singapore 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163298/
https://www.ncbi.nlm.nih.gov/pubmed/37256029
http://dx.doi.org/10.1007/s41870-023-01268-w
_version_ 1785037858083438592
author George, Lijimol
Sumathy, P.
author_facet George, Lijimol
Sumathy, P.
author_sort George, Lijimol
collection PubMed
description Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.
format Online
Article
Text
id pubmed-10163298
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer Nature Singapore
record_format MEDLINE/PubMed
spelling pubmed-101632982023-05-09 An integrated clustering and BERT framework for improved topic modeling George, Lijimol Sumathy, P. Int J Inf Technol Original Research Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering based on dimensionality reduction have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE and UMAP based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering with dimensionality reduction would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications. Springer Nature Singapore 2023-05-06 2023 /pmc/articles/PMC10163298/ /pubmed/37256029 http://dx.doi.org/10.1007/s41870-023-01268-w Text en © The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Original Research
George, Lijimol
Sumathy, P.
An integrated clustering and BERT framework for improved topic modeling
title An integrated clustering and BERT framework for improved topic modeling
title_full An integrated clustering and BERT framework for improved topic modeling
title_fullStr An integrated clustering and BERT framework for improved topic modeling
title_full_unstemmed An integrated clustering and BERT framework for improved topic modeling
title_short An integrated clustering and BERT framework for improved topic modeling
title_sort integrated clustering and bert framework for improved topic modeling
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10163298/
https://www.ncbi.nlm.nih.gov/pubmed/37256029
http://dx.doi.org/10.1007/s41870-023-01268-w
work_keys_str_mv AT georgelijimol anintegratedclusteringandbertframeworkforimprovedtopicmodeling
AT sumathyp anintegratedclusteringandbertframeworkforimprovedtopicmodeling
AT georgelijimol integratedclusteringandbertframeworkforimprovedtopicmodeling
AT sumathyp integratedclusteringandbertframeworkforimprovedtopicmodeling