Cargando…
Extracting information and inferences from a large text corpus
The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizat...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer Nature Singapore
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9676895/ https://www.ncbi.nlm.nih.gov/pubmed/36440061 http://dx.doi.org/10.1007/s41870-022-01123-4 |
_version_ | 1784833692525395968 |
---|---|
author | Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna |
author_facet | Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna |
author_sort | Avasthi, Sandhya |
collection | PubMed |
description | The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications. |
format | Online Article Text |
id | pubmed-9676895 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer Nature Singapore |
record_format | MEDLINE/PubMed |
spelling | pubmed-96768952022-11-21 Extracting information and inferences from a large text corpus Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna Int J Inf Technol Original Research The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications. Springer Nature Singapore 2022-11-20 2023 /pmc/articles/PMC9676895/ /pubmed/36440061 http://dx.doi.org/10.1007/s41870-022-01123-4 Text en © The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management 2022, Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Original Research Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna Extracting information and inferences from a large text corpus |
title | Extracting information and inferences from a large text corpus |
title_full | Extracting information and inferences from a large text corpus |
title_fullStr | Extracting information and inferences from a large text corpus |
title_full_unstemmed | Extracting information and inferences from a large text corpus |
title_short | Extracting information and inferences from a large text corpus |
title_sort | extracting information and inferences from a large text corpus |
topic | Original Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9676895/ https://www.ncbi.nlm.nih.gov/pubmed/36440061 http://dx.doi.org/10.1007/s41870-022-01123-4 |
work_keys_str_mv | AT avasthisandhya extractinginformationandinferencesfromalargetextcorpus AT chauhanritu extractinginformationandinferencesfromalargetextcorpus AT acharjyadebiprasanna extractinginformationandinferencesfromalargetextcorpus |