Cargando…

Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya

Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, bi...

Descripción completa

Detalles Bibliográficos
Autores principales: Muchene, Leacky, Safari, Wende
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7790388/
https://www.ncbi.nlm.nih.gov/pubmed/33411774
http://dx.doi.org/10.1371/journal.pone.0243208
_version_ 1783633414099828736
author Muchene, Leacky
Safari, Wende
author_facet Muchene, Leacky
Safari, Wende
author_sort Muchene, Leacky
collection PubMed
description Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics.
format Online
Article
Text
id pubmed-7790388
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-77903882021-01-27 Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya Muchene, Leacky Safari, Wende PLoS One Research Article Unsupervised statistical analysis of unstructured data has gained wide acceptance especially in natural language processing and text mining domains. Topic modelling with Latent Dirichlet Allocation is one such statistical tool that has been successfully applied to synthesize collections of legal, biomedical documents and journalistic topics. We applied a novel two-stage topic modelling approach and illustrated the methodology with data from a collection of published abstracts from the University of Nairobi, Kenya. In the first stage, topic modelling with Latent Dirichlet Allocation was applied to derive the per-document topic probabilities. To more succinctly present the topics, in the second stage, hierarchical clustering with Hellinger distance was applied to derive the final clusters of topics. The analysis showed that dominant research themes in the university include: HIV and malaria research, research on agricultural and veterinary services as well as cross-cutting themes in humanities and social sciences. Further, the use of hierarchical clustering in the second stage reduces the discovered latent topics to clusters of homogeneous topics. Public Library of Science 2021-01-07 /pmc/articles/PMC7790388/ /pubmed/33411774 http://dx.doi.org/10.1371/journal.pone.0243208 Text en © 2021 Muchene, Safari http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Muchene, Leacky
Safari, Wende
Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya
title Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya
title_full Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya
title_fullStr Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya
title_full_unstemmed Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya
title_short Two-stage topic modelling of scientific publications: A case study of University of Nairobi, Kenya
title_sort two-stage topic modelling of scientific publications: a case study of university of nairobi, kenya
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7790388/
https://www.ncbi.nlm.nih.gov/pubmed/33411774
http://dx.doi.org/10.1371/journal.pone.0243208
work_keys_str_mv AT mucheneleacky twostagetopicmodellingofscientificpublicationsacasestudyofuniversityofnairobikenya
AT safariwende twostagetopicmodellingofscientificpublicationsacasestudyofuniversityofnairobikenya