Cargando…

Comparative evaluation of gene set analysis approaches for RNA-Seq data

BACKGROUND: Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despi...

Descripción completa

Detalles Bibliográficos
Autores principales: Rahmatallah, Yasir, Emmert-Streib, Frank, Glazko, Galina
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4265362/
https://www.ncbi.nlm.nih.gov/pubmed/25475910
http://dx.doi.org/10.1186/s12859-014-0397-8
_version_ 1782348874378117120
author Rahmatallah, Yasir
Emmert-Streib, Frank
Glazko, Galina
author_facet Rahmatallah, Yasir
Emmert-Streib, Frank
Glazko, Galina
author_sort Rahmatallah, Yasir
collection PubMed
description BACKGROUND: Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood. RESULTS: We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways. CONCLUSIONS: Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0397-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4265362
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42653622014-12-15 Comparative evaluation of gene set analysis approaches for RNA-Seq data Rahmatallah, Yasir Emmert-Streib, Frank Glazko, Galina BMC Bioinformatics Methodology Article BACKGROUND: Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood. RESULTS: We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways. CONCLUSIONS: Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0397-8) contains supplementary material, which is available to authorized users. BioMed Central 2014-12-05 /pmc/articles/PMC4265362/ /pubmed/25475910 http://dx.doi.org/10.1186/s12859-014-0397-8 Text en © Rahmatallah et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Rahmatallah, Yasir
Emmert-Streib, Frank
Glazko, Galina
Comparative evaluation of gene set analysis approaches for RNA-Seq data
title Comparative evaluation of gene set analysis approaches for RNA-Seq data
title_full Comparative evaluation of gene set analysis approaches for RNA-Seq data
title_fullStr Comparative evaluation of gene set analysis approaches for RNA-Seq data
title_full_unstemmed Comparative evaluation of gene set analysis approaches for RNA-Seq data
title_short Comparative evaluation of gene set analysis approaches for RNA-Seq data
title_sort comparative evaluation of gene set analysis approaches for rna-seq data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4265362/
https://www.ncbi.nlm.nih.gov/pubmed/25475910
http://dx.doi.org/10.1186/s12859-014-0397-8
work_keys_str_mv AT rahmatallahyasir comparativeevaluationofgenesetanalysisapproachesforrnaseqdata
AT emmertstreibfrank comparativeevaluationofgenesetanalysisapproachesforrnaseqdata
AT glazkogalina comparativeevaluationofgenesetanalysisapproachesforrnaseqdata