Cargando…

Gene expression based survival prediction for cancer patients—A topic modeling approach

Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping...

Descripción completa

Detalles Bibliográficos
Autores principales: Kumar, Luke, Greiner, Russell
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6857918/
https://www.ncbi.nlm.nih.gov/pubmed/31730620
http://dx.doi.org/10.1371/journal.pone.0224446
_version_ 1783470848906100736
author Kumar, Luke
Greiner, Russell
author_facet Kumar, Luke
Greiner, Russell
author_sort Kumar, Luke
collection PubMed
description Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival.
format Online
Article
Text
id pubmed-6857918
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-68579182019-12-07 Gene expression based survival prediction for cancer patients—A topic modeling approach Kumar, Luke Greiner, Russell PLoS One Research Article Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model—e.g., to accommodate the real-valued expression values—leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional “distribution vector” as input to a learning algorithm—here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this “dLDA+MTLR” approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values—here using the mRNAseq modality—and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent “D-calibrated” measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival. Public Library of Science 2019-11-15 /pmc/articles/PMC6857918/ /pubmed/31730620 http://dx.doi.org/10.1371/journal.pone.0224446 Text en © 2019 Kumar, Greiner http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Kumar, Luke
Greiner, Russell
Gene expression based survival prediction for cancer patients—A topic modeling approach
title Gene expression based survival prediction for cancer patients—A topic modeling approach
title_full Gene expression based survival prediction for cancer patients—A topic modeling approach
title_fullStr Gene expression based survival prediction for cancer patients—A topic modeling approach
title_full_unstemmed Gene expression based survival prediction for cancer patients—A topic modeling approach
title_short Gene expression based survival prediction for cancer patients—A topic modeling approach
title_sort gene expression based survival prediction for cancer patients—a topic modeling approach
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6857918/
https://www.ncbi.nlm.nih.gov/pubmed/31730620
http://dx.doi.org/10.1371/journal.pone.0224446
work_keys_str_mv AT kumarluke geneexpressionbasedsurvivalpredictionforcancerpatientsatopicmodelingapproach
AT greinerrussell geneexpressionbasedsurvivalpredictionforcancerpatientsatopicmodelingapproach