Cargando…
Using topic-noise models to generate domain-specific topics across data sources
Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources lik...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer London
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9842404/ https://www.ncbi.nlm.nih.gov/pubmed/36683608 http://dx.doi.org/10.1007/s10115-022-01805-2 |
_version_ | 1784870117133254656 |
---|---|
author | Churchill, Rob Singh, Lisa |
author_facet | Churchill, Rob Singh, Lisa |
author_sort | Churchill, Rob |
collection | PubMed |
description | Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an s-partite graph and identifies core topics that blend topics from across s sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources. |
format | Online Article Text |
id | pubmed-9842404 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Springer London |
record_format | MEDLINE/PubMed |
spelling | pubmed-98424042023-01-17 Using topic-noise models to generate domain-specific topics across data sources Churchill, Rob Singh, Lisa Knowl Inf Syst Regular Paper Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an s-partite graph and identifies core topics that blend topics from across s sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources. Springer London 2023-01-16 2023 /pmc/articles/PMC9842404/ /pubmed/36683608 http://dx.doi.org/10.1007/s10115-022-01805-2 Text en © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023, Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Regular Paper Churchill, Rob Singh, Lisa Using topic-noise models to generate domain-specific topics across data sources |
title | Using topic-noise models to generate domain-specific topics across data sources |
title_full | Using topic-noise models to generate domain-specific topics across data sources |
title_fullStr | Using topic-noise models to generate domain-specific topics across data sources |
title_full_unstemmed | Using topic-noise models to generate domain-specific topics across data sources |
title_short | Using topic-noise models to generate domain-specific topics across data sources |
title_sort | using topic-noise models to generate domain-specific topics across data sources |
topic | Regular Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9842404/ https://www.ncbi.nlm.nih.gov/pubmed/36683608 http://dx.doi.org/10.1007/s10115-022-01805-2 |
work_keys_str_mv | AT churchillrob usingtopicnoisemodelstogeneratedomainspecifictopicsacrossdatasources AT singhlisa usingtopicnoisemodelstogeneratedomainspecifictopicsacrossdatasources |