Cargando…

A network approach to topic models

One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gerlach, Martin, Peixoto, Tiago P., Altmann, Eduardo G.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Association for the Advancement of Science 2018
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6051742/ https://www.ncbi.nlm.nih.gov/pubmed/30035215 http://dx.doi.org/10.1126/sciadv.aaq1360

_version_	1783340574183522304
author	Gerlach, Martin Peixoto, Tiago P. Altmann, Eduardo G.
author_facet	Gerlach, Martin Peixoto, Tiago P. Altmann, Eduardo G.
author_sort	Gerlach, Martin
collection	PubMed
description	One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with nonparametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. Our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.
format	Online Article Text
id	pubmed-6051742
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	American Association for the Advancement of Science
record_format	MEDLINE/PubMed
spelling	pubmed-60517422018-07-22 A network approach to topic models Gerlach, Martin Peixoto, Tiago P. Altmann, Eduardo G. Sci Adv Research Articles One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with nonparametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. Our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields. American Association for the Advancement of Science 2018-07-18 /pmc/articles/PMC6051742/ /pubmed/30035215 http://dx.doi.org/10.1126/sciadv.aaq1360 Text en Copyright © 2018 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works. Distributed under a Creative Commons Attribution NonCommercial License 4.0 (CC BY-NC). http://creativecommons.org/licenses/by-nc/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial license (http://creativecommons.org/licenses/by-nc/4.0/) , which permits use, distribution, and reproduction in any medium, so long as the resultant use is not for commercial advantage and provided the original work is properly cited.
spellingShingle	Research Articles Gerlach, Martin Peixoto, Tiago P. Altmann, Eduardo G. A network approach to topic models
title	A network approach to topic models
title_full	A network approach to topic models
title_fullStr	A network approach to topic models
title_full_unstemmed	A network approach to topic models
title_short	A network approach to topic models
title_sort	network approach to topic models
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6051742/ https://www.ncbi.nlm.nih.gov/pubmed/30035215 http://dx.doi.org/10.1126/sciadv.aaq1360
work_keys_str_mv	AT gerlachmartin anetworkapproachtotopicmodels AT peixototiagop anetworkapproachtotopicmodels AT altmanneduardog anetworkapproachtotopicmodels AT gerlachmartin networkapproachtotopicmodels AT peixototiagop networkapproachtotopicmodels AT altmanneduardog networkapproachtotopicmodels

A network approach to topic models

Ejemplares similares