Cargando…

Modeling Topics in DFA-Based Lemmatized Gujarati Text

Topic modeling is a machine learning algorithm based on statistics that follows unsupervised machine learning techniques for mapping a high-dimensional corpus to a low-dimensional topical subspace, but it could be better. A topic model’s topic is expected to be interpretable as a concept, i.e., corr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chauhan, Uttam, Shah, Shrusti, Shiroya, Dharati, Solanki, Dipti, Patel, Zeel, Bhatia, Jitendra, Tanwar, Sudeep, Sharma, Ravi, Marina, Verdes, Raboaca, Maria Simona
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10007128/ https://www.ncbi.nlm.nih.gov/pubmed/36904915 http://dx.doi.org/10.3390/s23052708

_version_	1784905441831026688
author	Chauhan, Uttam Shah, Shrusti Shiroya, Dharati Solanki, Dipti Patel, Zeel Bhatia, Jitendra Tanwar, Sudeep Sharma, Ravi Marina, Verdes Raboaca, Maria Simona
author_facet	Chauhan, Uttam Shah, Shrusti Shiroya, Dharati Solanki, Dipti Patel, Zeel Bhatia, Jitendra Tanwar, Sudeep Sharma, Ravi Marina, Verdes Raboaca, Maria Simona
author_sort	Chauhan, Uttam
collection	PubMed
description	Topic modeling is a machine learning algorithm based on statistics that follows unsupervised machine learning techniques for mapping a high-dimensional corpus to a low-dimensional topical subspace, but it could be better. A topic model’s topic is expected to be interpretable as a concept, i.e., correspond to human understanding of a topic occurring in texts. While discovering corpus themes, inference constantly uses vocabulary that impacts topic quality due to its size. Inflectional forms are in the corpus. Since words frequently appear in the same sentence and are likely to have a latent topic, practically all topic models rely on co-occurrence signals between various terms in the corpus. The topics get weaker because of the abundance of distinct tokens in languages with extensive inflectional morphology. Lemmatization is often used to preempt this problem. Gujarati is one of the morphologically rich languages, as a word may have several inflectional forms. This paper proposes a deterministic finite automaton (DFA) based lemmatization technique for the Gujarati language to transform lemmas into their root words. The set of topics is then inferred from this lemmatized corpus of Gujarati text. We employ statistical divergence measurements to identify semantically less coherent (overly general) topics. The result shows that the lemmatized Gujarati corpus learns more interpretable and meaningful subjects than unlemmatized text. Finally, results show that lemmatization curtails the size of vocabulary decreases by 16% and the semantic coherence for all three measurements—Log Conditional Probability, Pointwise Mutual Information, and Normalized Pointwise Mutual Information—from −9.39 to −7.49, −6.79 to −5.18, and −0.23 to −0.17, respectively.
format	Online Article Text
id	pubmed-10007128
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-100071282023-03-12 Modeling Topics in DFA-Based Lemmatized Gujarati Text Chauhan, Uttam Shah, Shrusti Shiroya, Dharati Solanki, Dipti Patel, Zeel Bhatia, Jitendra Tanwar, Sudeep Sharma, Ravi Marina, Verdes Raboaca, Maria Simona Sensors (Basel) Article Topic modeling is a machine learning algorithm based on statistics that follows unsupervised machine learning techniques for mapping a high-dimensional corpus to a low-dimensional topical subspace, but it could be better. A topic model’s topic is expected to be interpretable as a concept, i.e., correspond to human understanding of a topic occurring in texts. While discovering corpus themes, inference constantly uses vocabulary that impacts topic quality due to its size. Inflectional forms are in the corpus. Since words frequently appear in the same sentence and are likely to have a latent topic, practically all topic models rely on co-occurrence signals between various terms in the corpus. The topics get weaker because of the abundance of distinct tokens in languages with extensive inflectional morphology. Lemmatization is often used to preempt this problem. Gujarati is one of the morphologically rich languages, as a word may have several inflectional forms. This paper proposes a deterministic finite automaton (DFA) based lemmatization technique for the Gujarati language to transform lemmas into their root words. The set of topics is then inferred from this lemmatized corpus of Gujarati text. We employ statistical divergence measurements to identify semantically less coherent (overly general) topics. The result shows that the lemmatized Gujarati corpus learns more interpretable and meaningful subjects than unlemmatized text. Finally, results show that lemmatization curtails the size of vocabulary decreases by 16% and the semantic coherence for all three measurements—Log Conditional Probability, Pointwise Mutual Information, and Normalized Pointwise Mutual Information—from −9.39 to −7.49, −6.79 to −5.18, and −0.23 to −0.17, respectively. MDPI 2023-03-01 /pmc/articles/PMC10007128/ /pubmed/36904915 http://dx.doi.org/10.3390/s23052708 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Chauhan, Uttam Shah, Shrusti Shiroya, Dharati Solanki, Dipti Patel, Zeel Bhatia, Jitendra Tanwar, Sudeep Sharma, Ravi Marina, Verdes Raboaca, Maria Simona Modeling Topics in DFA-Based Lemmatized Gujarati Text
title	Modeling Topics in DFA-Based Lemmatized Gujarati Text
title_full	Modeling Topics in DFA-Based Lemmatized Gujarati Text
title_fullStr	Modeling Topics in DFA-Based Lemmatized Gujarati Text
title_full_unstemmed	Modeling Topics in DFA-Based Lemmatized Gujarati Text
title_short	Modeling Topics in DFA-Based Lemmatized Gujarati Text
title_sort	modeling topics in dfa-based lemmatized gujarati text
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10007128/ https://www.ncbi.nlm.nih.gov/pubmed/36904915 http://dx.doi.org/10.3390/s23052708
work_keys_str_mv	AT chauhanuttam modelingtopicsindfabasedlemmatizedgujaratitext AT shahshrusti modelingtopicsindfabasedlemmatizedgujaratitext AT shiroyadharati modelingtopicsindfabasedlemmatizedgujaratitext AT solankidipti modelingtopicsindfabasedlemmatizedgujaratitext AT patelzeel modelingtopicsindfabasedlemmatizedgujaratitext AT bhatiajitendra modelingtopicsindfabasedlemmatizedgujaratitext AT tanwarsudeep modelingtopicsindfabasedlemmatizedgujaratitext AT sharmaravi modelingtopicsindfabasedlemmatizedgujaratitext AT marinaverdes modelingtopicsindfabasedlemmatizedgujaratitext AT raboacamariasimona modelingtopicsindfabasedlemmatizedgujaratitext

Modeling Topics in DFA-Based Lemmatized Gujarati Text

Ejemplares similares