Cargando…

Web content topic modeling using LDA and HTML tags

An immense volume of digital documents exists online and offline with content that can offer useful information and insights. Utilizing topic modeling enhances the analysis and understanding of digital documents. Topic modeling discovers latent semantic structures or topics within a set of digital t...

Descripción completa

Detalles Bibliográficos
Autores principales: Altarturi, Hamza H.M., Saadoon, Muntadher, Anuar, Nor Badrul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403181/
https://www.ncbi.nlm.nih.gov/pubmed/37547394
http://dx.doi.org/10.7717/peerj-cs.1459
_version_ 1785085012156088320
author Altarturi, Hamza H.M.
Saadoon, Muntadher
Anuar, Nor Badrul
author_facet Altarturi, Hamza H.M.
Saadoon, Muntadher
Anuar, Nor Badrul
author_sort Altarturi, Hamza H.M.
collection PubMed
description An immense volume of digital documents exists online and offline with content that can offer useful information and insights. Utilizing topic modeling enhances the analysis and understanding of digital documents. Topic modeling discovers latent semantic structures or topics within a set of digital textual documents. The Internet of Things, Blockchain, recommender system, and search engine optimization applications use topic modeling to handle data mining tasks, such as classification and clustering. The usefulness of topic models depends on the quality of resulting term patterns and topics with high quality. Topic coherence is the standard metric to measure the quality of topic models. Previous studies build topic models to generally work on conventional documents, and they are insufficient and underperform when applied to web content data due to differences in the structure of the conventional and HTML documents. Neglecting the unique structure of web content leads to missing otherwise coherent topics and, therefore, low topic quality. This study aims to propose an innovative topic model to learn coherence topics in web content data. We present the HTML Topic Model (HTM), a web content topic model that takes into consideration the HTML tags to understand the structure of web pages. We conducted two series of experiments to demonstrate the limitations of the existing topic models and examine the topic coherence of the HTM against the widely used Latent Dirichlet Allocation (LDA) model and its variants, namely the Correlated Topic Model, the Dirichlet Multinomial Regression, the Hierarchical Dirichlet Process, the Hierarchical Latent Dirichlet Allocation, the pseudo-document based Topic Model, and the Supervised Latent Dirichlet Allocation models. The first experiment demonstrates the limitations of the existing topic models when applied to web content data and, therefore, the essential need for a web content topic model. When applied to web data, the overall performance dropped an average of five times and, in some cases, up to approximately 20 times lower than when applied to conventional data. The second experiment then evaluates the effectiveness of the HTM model in discovering topics and term patterns of web content data. The HTM model achieved an overall 35% improvement in topic coherence compared to the LDA.
format Online
Article
Text
id pubmed-10403181
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-104031812023-08-05 Web content topic modeling using LDA and HTML tags Altarturi, Hamza H.M. Saadoon, Muntadher Anuar, Nor Badrul PeerJ Comput Sci Data Mining and Machine Learning An immense volume of digital documents exists online and offline with content that can offer useful information and insights. Utilizing topic modeling enhances the analysis and understanding of digital documents. Topic modeling discovers latent semantic structures or topics within a set of digital textual documents. The Internet of Things, Blockchain, recommender system, and search engine optimization applications use topic modeling to handle data mining tasks, such as classification and clustering. The usefulness of topic models depends on the quality of resulting term patterns and topics with high quality. Topic coherence is the standard metric to measure the quality of topic models. Previous studies build topic models to generally work on conventional documents, and they are insufficient and underperform when applied to web content data due to differences in the structure of the conventional and HTML documents. Neglecting the unique structure of web content leads to missing otherwise coherent topics and, therefore, low topic quality. This study aims to propose an innovative topic model to learn coherence topics in web content data. We present the HTML Topic Model (HTM), a web content topic model that takes into consideration the HTML tags to understand the structure of web pages. We conducted two series of experiments to demonstrate the limitations of the existing topic models and examine the topic coherence of the HTM against the widely used Latent Dirichlet Allocation (LDA) model and its variants, namely the Correlated Topic Model, the Dirichlet Multinomial Regression, the Hierarchical Dirichlet Process, the Hierarchical Latent Dirichlet Allocation, the pseudo-document based Topic Model, and the Supervised Latent Dirichlet Allocation models. The first experiment demonstrates the limitations of the existing topic models when applied to web content data and, therefore, the essential need for a web content topic model. When applied to web data, the overall performance dropped an average of five times and, in some cases, up to approximately 20 times lower than when applied to conventional data. The second experiment then evaluates the effectiveness of the HTM model in discovering topics and term patterns of web content data. The HTM model achieved an overall 35% improvement in topic coherence compared to the LDA. PeerJ Inc. 2023-07-11 /pmc/articles/PMC10403181/ /pubmed/37547394 http://dx.doi.org/10.7717/peerj-cs.1459 Text en ©2023 Altarturi et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Data Mining and Machine Learning
Altarturi, Hamza H.M.
Saadoon, Muntadher
Anuar, Nor Badrul
Web content topic modeling using LDA and HTML tags
title Web content topic modeling using LDA and HTML tags
title_full Web content topic modeling using LDA and HTML tags
title_fullStr Web content topic modeling using LDA and HTML tags
title_full_unstemmed Web content topic modeling using LDA and HTML tags
title_short Web content topic modeling using LDA and HTML tags
title_sort web content topic modeling using lda and html tags
topic Data Mining and Machine Learning
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403181/
https://www.ncbi.nlm.nih.gov/pubmed/37547394
http://dx.doi.org/10.7717/peerj-cs.1459
work_keys_str_mv AT altarturihamzahm webcontenttopicmodelingusingldaandhtmltags
AT saadoonmuntadher webcontenttopicmodelingusingldaandhtmltags
AT anuarnorbadrul webcontenttopicmodelingusingldaandhtmltags