Cargando…

Extracting information and inferences from a large text corpus

The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Avasthi, Sandhya, Chauhan, Ritu, Acharjya, Debi Prasanna
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Nature Singapore 2022
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9676895/ https://www.ncbi.nlm.nih.gov/pubmed/36440061 http://dx.doi.org/10.1007/s41870-022-01123-4

_version_	1784833692525395968
author	Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna
author_facet	Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna
author_sort	Avasthi, Sandhya
collection	PubMed
description	The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications.
format	Online Article Text
id	pubmed-9676895
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer Nature Singapore
record_format	MEDLINE/PubMed
spelling	pubmed-96768952022-11-21 Extracting information and inferences from a large text corpus Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna Int J Inf Technol Original Research The usage of various software applications has grown tremendously due to the onset of Industry 4.0, giving rise to the accumulation of all forms of data. The scientific, biological, and social media text collections demand efficient machine learning methods for data interpretability, which organizations need in decision-making of all sorts. The topic models can be applied in text mining of biomedical articles, scientific articles, Twitter data, and blog posts. This paper analyzes and provides a comparison of the performance of Latent Dirichlet Allocation (LDA), Dynamic Topic Model (DTM), and Embedded Topic Model (ETM) techniques. An incremental topic model with word embedding (ITMWE) is proposed that processes large text data in an incremental environment and extracts latent topics that best describe the document collections. Experiments in both offline and online settings on large real-world document collections such as CORD-19, NIPS papers, and Tweet datasets show that, while LDA and DTM is a good model for discovering word-level topics, ITMWE discovers better document-level topic groups more efficiently in a dynamic environment, which is crucial in text mining applications. Springer Nature Singapore 2022-11-20 2023 /pmc/articles/PMC9676895/ /pubmed/36440061 http://dx.doi.org/10.1007/s41870-022-01123-4 Text en © The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management 2022, Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Original Research Avasthi, Sandhya Chauhan, Ritu Acharjya, Debi Prasanna Extracting information and inferences from a large text corpus
title	Extracting information and inferences from a large text corpus
title_full	Extracting information and inferences from a large text corpus
title_fullStr	Extracting information and inferences from a large text corpus
title_full_unstemmed	Extracting information and inferences from a large text corpus
title_short	Extracting information and inferences from a large text corpus
title_sort	extracting information and inferences from a large text corpus
topic	Original Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9676895/ https://www.ncbi.nlm.nih.gov/pubmed/36440061 http://dx.doi.org/10.1007/s41870-022-01123-4
work_keys_str_mv	AT avasthisandhya extractinginformationandinferencesfromalargetextcorpus AT chauhanritu extractinginformationandinferencesfromalargetextcorpus AT acharjyadebiprasanna extractinginformationandinferencesfromalargetextcorpus

Extracting information and inferences from a large text corpus

Ejemplares similares