Cargando…

Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies

Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biologi...

Descripción completa

Detalles Bibliográficos
Autores principales: Wenric, Stephane, Shemirani, Ruhollah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6085558/
https://www.ncbi.nlm.nih.gov/pubmed/30123241
http://dx.doi.org/10.3389/fgene.2018.00297
_version_ 1783346355429703680
author Wenric, Stephane
Shemirani, Ruhollah
author_facet Wenric, Stephane
Shemirani, Ruhollah
author_sort Wenric, Stephane
collection PubMed
description Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.
format Online
Article
Text
id pubmed-6085558
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-60855582018-08-17 Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies Wenric, Stephane Shemirani, Ruhollah Front Genet Genetics Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis. Frontiers Media S.A. 2018-08-03 /pmc/articles/PMC6085558/ /pubmed/30123241 http://dx.doi.org/10.3389/fgene.2018.00297 Text en Copyright © 2018 Wenric and Shemirani. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Wenric, Stephane
Shemirani, Ruhollah
Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies
title Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies
title_full Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies
title_fullStr Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies
title_full_unstemmed Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies
title_short Using Supervised Learning Methods for Gene Selection in RNA-Seq Case-Control Studies
title_sort using supervised learning methods for gene selection in rna-seq case-control studies
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6085558/
https://www.ncbi.nlm.nih.gov/pubmed/30123241
http://dx.doi.org/10.3389/fgene.2018.00297
work_keys_str_mv AT wenricstephane usingsupervisedlearningmethodsforgeneselectioninrnaseqcasecontrolstudies
AT shemiraniruhollah usingsupervisedlearningmethodsforgeneselectioninrnaseqcasecontrolstudies