Cargando…
Context-Aware Latent Dirichlet Allocation for Topic Segmentation
We propose a new generative model for topic segmentation based on Latent Dirichlet Allocation. The task is to divide a document into a sequence of topically coherent segments, while preserving long topic change-points (coherency) and keeping short topic segments from getting merged (saliency). Most...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206242/ http://dx.doi.org/10.1007/978-3-030-47426-3_37 |
_version_ | 1783530375941718016 |
---|---|
author | Li, Wenbo Matsukawa, Tetsu Saigo, Hiroto Suzuki, Einoshin |
author_facet | Li, Wenbo Matsukawa, Tetsu Saigo, Hiroto Suzuki, Einoshin |
author_sort | Li, Wenbo |
collection | PubMed |
description | We propose a new generative model for topic segmentation based on Latent Dirichlet Allocation. The task is to divide a document into a sequence of topically coherent segments, while preserving long topic change-points (coherency) and keeping short topic segments from getting merged (saliency). Most of the existing models either fuse topic segments by keywords or focus on modeling word co-occurrence patterns without merging. They can hardly achieve both coherency and saliency since many words have high uncertainties in topic assignments due to their polysemous nature. To solve this problem, we introduce topic-specific co-occurrence of word pairs within contexts in modeling, to generate more coherent segments and alleviate the influence of irrelevant words on topic assignment. We also design an optimization algorithm to eliminate redundant items in the generated topic segments. Experimental results show that our proposal produces significant improvements in both topic coherence and topic segmentation. |
format | Online Article Text |
id | pubmed-7206242 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-72062422020-05-08 Context-Aware Latent Dirichlet Allocation for Topic Segmentation Li, Wenbo Matsukawa, Tetsu Saigo, Hiroto Suzuki, Einoshin Advances in Knowledge Discovery and Data Mining Article We propose a new generative model for topic segmentation based on Latent Dirichlet Allocation. The task is to divide a document into a sequence of topically coherent segments, while preserving long topic change-points (coherency) and keeping short topic segments from getting merged (saliency). Most of the existing models either fuse topic segments by keywords or focus on modeling word co-occurrence patterns without merging. They can hardly achieve both coherency and saliency since many words have high uncertainties in topic assignments due to their polysemous nature. To solve this problem, we introduce topic-specific co-occurrence of word pairs within contexts in modeling, to generate more coherent segments and alleviate the influence of irrelevant words on topic assignment. We also design an optimization algorithm to eliminate redundant items in the generated topic segments. Experimental results show that our proposal produces significant improvements in both topic coherence and topic segmentation. 2020-04-17 /pmc/articles/PMC7206242/ http://dx.doi.org/10.1007/978-3-030-47426-3_37 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Li, Wenbo Matsukawa, Tetsu Saigo, Hiroto Suzuki, Einoshin Context-Aware Latent Dirichlet Allocation for Topic Segmentation |
title | Context-Aware Latent Dirichlet Allocation for Topic Segmentation |
title_full | Context-Aware Latent Dirichlet Allocation for Topic Segmentation |
title_fullStr | Context-Aware Latent Dirichlet Allocation for Topic Segmentation |
title_full_unstemmed | Context-Aware Latent Dirichlet Allocation for Topic Segmentation |
title_short | Context-Aware Latent Dirichlet Allocation for Topic Segmentation |
title_sort | context-aware latent dirichlet allocation for topic segmentation |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206242/ http://dx.doi.org/10.1007/978-3-030-47426-3_37 |
work_keys_str_mv | AT liwenbo contextawarelatentdirichletallocationfortopicsegmentation AT matsukawatetsu contextawarelatentdirichletallocationfortopicsegmentation AT saigohiroto contextawarelatentdirichletallocationfortopicsegmentation AT suzukieinoshin contextawarelatentdirichletallocationfortopicsegmentation |