Cargando…

Multiomics Topic Modeling for Breast Cancer Classification

SIMPLE SUMMARY: Topic models are algorithms introduced for discovering hidden topics or latent variables in large, unstructured text corpora. Leveraging on analogies between texts and gene expression profiles, these algorithms can be used to find structures in expression data. This work presents an...

Descripción completa

Detalles Bibliográficos
Autores principales: Valle, Filippo, Osella, Matteo, Caselle, Michele
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8909787/
https://www.ncbi.nlm.nih.gov/pubmed/35267458
http://dx.doi.org/10.3390/cancers14051150
Descripción
Sumario:SIMPLE SUMMARY: Topic models are algorithms introduced for discovering hidden topics or latent variables in large, unstructured text corpora. Leveraging on analogies between texts and gene expression profiles, these algorithms can be used to find structures in expression data. This work presents an application of topic modeling techniques for the identification of breast cancer subtypes. In particular, we extended a specific class of topic models to allow a multiomics approach. As an illustrative example, considering both messenger RNA and microRNA expression levels, we were able to clearly distinguish healthy from tumor samples as well as the different breast cancer subtypes. The integration of different layers of information is crucial for the observed classification accuracy. Our approach naturally provides the genes and the microRNAs associated to the specific topics that are used for sample organization. We show that indeed these topics often contain genes involved in breast cancer development and are associated to different survival probabilities. ABSTRACT: The integration of transcriptional data with other layers of information, such as the post-transcriptional regulation mediated by microRNAs, can be crucial to identify the driver genes and the subtypes of complex and heterogeneous diseases such as cancer. This paper presents an approach based on topic modeling to accomplish this integration task. More specifically, we show how an algorithm based on a hierarchical version of stochastic block modeling can be naturally extended to integrate any combination of ’omics data. We test this approach on breast cancer samples from the TCGA database, integrating data on messenger RNA, microRNAs, and copy number variations. We show that the inclusion of the microRNA layer significantly improves the accuracy of subtype classification. Moreover, some of the hidden structures or “topics” that the algorithm extracts actually correspond to genes and microRNAs involved in breast cancer development and are associated to the survival probability.