Cargando…
Comparative evaluation of gene set analysis approaches for RNA-Seq data
BACKGROUND: Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despi...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4265362/ https://www.ncbi.nlm.nih.gov/pubmed/25475910 http://dx.doi.org/10.1186/s12859-014-0397-8 |
_version_ | 1782348874378117120 |
---|---|
author | Rahmatallah, Yasir Emmert-Streib, Frank Glazko, Galina |
author_facet | Rahmatallah, Yasir Emmert-Streib, Frank Glazko, Galina |
author_sort | Rahmatallah, Yasir |
collection | PubMed |
description | BACKGROUND: Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood. RESULTS: We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways. CONCLUSIONS: Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0397-8) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4265362 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42653622014-12-15 Comparative evaluation of gene set analysis approaches for RNA-Seq data Rahmatallah, Yasir Emmert-Streib, Frank Glazko, Galina BMC Bioinformatics Methodology Article BACKGROUND: Over the last few years transcriptome sequencing (RNA-Seq) has almost completely taken over microarrays for high-throughput studies of gene expression. Currently, the most popular use of RNA-Seq is to identify genes which are differentially expressed between two or more conditions. Despite the importance of Gene Set Analysis (GSA) in the interpretation of the results from RNA-Seq experiments, the limitations of GSA methods developed for microarrays in the context of RNA-Seq data are not well understood. RESULTS: We provide a thorough evaluation of popular multivariate and gene-level self-contained GSA approaches on simulated and real RNA-Seq data. The multivariate approach employs multivariate non-parametric tests combined with popular normalizations for RNA-Seq data. The gene-level approach utilizes univariate tests designed for the analysis of RNA-Seq data to find gene-specific P-values and combines them into a pathway P-value using classical statistical techniques. Our results demonstrate that the Type I error rate and the power of multivariate tests depend only on the test statistics and are insensitive to the different normalizations. In general standard multivariate GSA tests detect pathways that do not have any bias in terms of pathways size, percentage of differentially expressed genes, or average gene length in a pathway. In contrast the Type I error rate and the power of gene-level GSA tests are heavily affected by the methods for combining P-values, and all aforementioned biases are present in detected pathways. CONCLUSIONS: Our result emphasizes the importance of using self-contained non-parametric multivariate tests for detecting differentially expressed pathways for RNA-Seq data and warns against applying gene-level GSA tests, especially because of their high level of Type I error rates for both, simulated and real data. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0397-8) contains supplementary material, which is available to authorized users. BioMed Central 2014-12-05 /pmc/articles/PMC4265362/ /pubmed/25475910 http://dx.doi.org/10.1186/s12859-014-0397-8 Text en © Rahmatallah et al.; licensee BioMed Central Ltd. 2014 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Rahmatallah, Yasir Emmert-Streib, Frank Glazko, Galina Comparative evaluation of gene set analysis approaches for RNA-Seq data |
title | Comparative evaluation of gene set analysis approaches for RNA-Seq data |
title_full | Comparative evaluation of gene set analysis approaches for RNA-Seq data |
title_fullStr | Comparative evaluation of gene set analysis approaches for RNA-Seq data |
title_full_unstemmed | Comparative evaluation of gene set analysis approaches for RNA-Seq data |
title_short | Comparative evaluation of gene set analysis approaches for RNA-Seq data |
title_sort | comparative evaluation of gene set analysis approaches for rna-seq data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4265362/ https://www.ncbi.nlm.nih.gov/pubmed/25475910 http://dx.doi.org/10.1186/s12859-014-0397-8 |
work_keys_str_mv | AT rahmatallahyasir comparativeevaluationofgenesetanalysisapproachesforrnaseqdata AT emmertstreibfrank comparativeevaluationofgenesetanalysisapproachesforrnaseqdata AT glazkogalina comparativeevaluationofgenesetanalysisapproachesforrnaseqdata |