Cargando…

Extracting information and inferences from a large text corpus

The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizat...

Descripción completa

Detalles Bibliográficos
Autores principales: Avasthi, Sandhya, Chauhan, Ritu, Acharjya, Debi Prasanna
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Nature Singapore 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9676895/
https://www.ncbi.nlm.nih.gov/pubmed/36440061
http://dx.doi.org/10.1007/s41870-022-01123-4
_version_ 1784833692525395968
author Avasthi, Sandhya
Chauhan, Ritu
Acharjya, Debi Prasanna
author_facet Avasthi, Sandhya
Chauhan, Ritu
Acharjya, Debi Prasanna
author_sort Avasthi, Sandhya
collection PubMed
description The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications.
format Online
Article
Text
id pubmed-9676895
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer Nature Singapore
record_format MEDLINE/PubMed
spelling pubmed-96768952022-11-21 Extracting information and inferences from a large text corpus Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna Int J Inf Technol Original Research The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications. Springer Nature Singapore 2022-11-20 2023 /pmc/articles/PMC9676895/ /pubmed/36440061 http://dx.doi.org/10.1007/s41870-022-01123-4 Text en © The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management 2022, Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Original Research
Avasthi, Sandhya
Chauhan, Ritu
Acharjya, Debi Prasanna
Extracting information and inferences from a large text corpus
title Extracting information and inferences from a large text corpus
title_full Extracting information and inferences from a large text corpus
title_fullStr Extracting information and inferences from a large text corpus
title_full_unstemmed Extracting information and inferences from a large text corpus
title_short Extracting information and inferences from a large text corpus
title_sort extracting information and inferences from a large text corpus
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9676895/
https://www.ncbi.nlm.nih.gov/pubmed/36440061
http://dx.doi.org/10.1007/s41870-022-01123-4
work_keys_str_mv AT avasthisandhya extractinginformationandinferencesfromalargetextcorpus
AT chauhanritu extractinginformationandinferencesfromalargetextcorpus
AT acharjyadebiprasanna extractinginformationandinferencesfromalargetextcorpus