Cargando…

A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data

SIMPLE SUMMARY: Topic modeling was introduced to classify texts of natural language by inferring their topic structure from the frequency of words. This paper assumes that analogously the cancer subtype identity, which is crucial for the correct diagnosis and treatment plan, can be extracted from ge...

Descripción completa

Detalles Bibliográficos
Autores principales: Valle, Filippo, Osella, Matteo, Caselle, Michele
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7766023/
https://www.ncbi.nlm.nih.gov/pubmed/33339347
http://dx.doi.org/10.3390/cancers12123799
_version_ 1783628619962122240
author Valle, Filippo
Osella, Matteo
Caselle, Michele
author_facet Valle, Filippo
Osella, Matteo
Caselle, Michele
author_sort Valle, Filippo
collection PubMed
description SIMPLE SUMMARY: Topic modeling was introduced to classify texts of natural language by inferring their topic structure from the frequency of words. This paper assumes that analogously the cancer subtype identity, which is crucial for the correct diagnosis and treatment plan, can be extracted from gene expression patterns with similar techniques. Focusing on breast and lung cancer, we show that state-of-the-art topic modeling techniques can successfully classify known subtypes and identify cohorts of patients with different survival probabilities. The topic structure hidden in expression data can be looked at as a biologically relevant low-dimensional data representation that can be used to build efficient classifiers of expression patterns. ABSTRACT: Topic modeling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on breast and lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample.
format Online
Article
Text
id pubmed-7766023
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-77660232020-12-28 A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data Valle, Filippo Osella, Matteo Caselle, Michele Cancers (Basel) Article SIMPLE SUMMARY: Topic modeling was introduced to classify texts of natural language by inferring their topic structure from the frequency of words. This paper assumes that analogously the cancer subtype identity, which is crucial for the correct diagnosis and treatment plan, can be extracted from gene expression patterns with similar techniques. Focusing on breast and lung cancer, we show that state-of-the-art topic modeling techniques can successfully classify known subtypes and identify cohorts of patients with different survival probabilities. The topic structure hidden in expression data can be looked at as a biologically relevant low-dimensional data representation that can be used to build efficient classifiers of expression patterns. ABSTRACT: Topic modeling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on breast and lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample. MDPI 2020-12-16 /pmc/articles/PMC7766023/ /pubmed/33339347 http://dx.doi.org/10.3390/cancers12123799 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Valle, Filippo
Osella, Matteo
Caselle, Michele
A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data
title A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data
title_full A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data
title_fullStr A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data
title_full_unstemmed A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data
title_short A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data
title_sort topic modeling analysis of tcga breast and lung cancer transcriptomic data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7766023/
https://www.ncbi.nlm.nih.gov/pubmed/33339347
http://dx.doi.org/10.3390/cancers12123799
work_keys_str_mv AT vallefilippo atopicmodelinganalysisoftcgabreastandlungcancertranscriptomicdata
AT osellamatteo atopicmodelinganalysisoftcgabreastandlungcancertranscriptomicdata
AT casellemichele atopicmodelinganalysisoftcgabreastandlungcancertranscriptomicdata
AT vallefilippo topicmodelinganalysisoftcgabreastandlungcancertranscriptomicdata
AT osellamatteo topicmodelinganalysisoftcgabreastandlungcancertranscriptomicdata
AT casellemichele topicmodelinganalysisoftcgabreastandlungcancertranscriptomicdata