Cargando…

Variability in estimated gene expression among commonly used RNA-seq pipelines

RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets using numerical methods such as clustering, classification, regression, and differential expression analysis. Such approaches rely on the assumption that mRNA abundance estimates from RNA-seq are reliable estim...

Descripción completa

Detalles Bibliográficos
Autores principales: Arora, Sonali, Pattwell, Siobhan S., Holland, Eric C., Bolouri, Hamid
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7026138/
https://www.ncbi.nlm.nih.gov/pubmed/32066774
http://dx.doi.org/10.1038/s41598-020-59516-z
_version_ 1783498628879351808
author Arora, Sonali
Pattwell, Siobhan S.
Holland, Eric C.
Bolouri, Hamid
author_facet Arora, Sonali
Pattwell, Siobhan S.
Holland, Eric C.
Bolouri, Hamid
author_sort Arora, Sonali
collection PubMed
description RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets using numerical methods such as clustering, classification, regression, and differential expression analysis. Such approaches rely on the assumption that mRNA abundance estimates from RNA-seq are reliable estimates of true expression levels. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that nearly 88% of protein-coding genes have similar gene expression profiles across all pipelines. However, for >12% of protein-coding genes, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold when applied to exactly the same samples and the same set of RNA-seq reads. Expression fold changes are similarly affected. Many of the impacted genes are widely studied disease-associated genes. We show that impacted genes exhibit diverse patterns of discordance among pipelines, suggesting that many inter-pipeline differences contribute to overall uncertainty in mRNA abundance estimates. A concerted, community-wide effort will be needed to develop gold-standards for estimating the mRNA abundance of the discordant genes reported here. In the meantime, our list of discordantly evaluated genes provides an important resource for robust marker discovery and target selection.
format Online
Article
Text
id pubmed-7026138
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-70261382020-02-26 Variability in estimated gene expression among commonly used RNA-seq pipelines Arora, Sonali Pattwell, Siobhan S. Holland, Eric C. Bolouri, Hamid Sci Rep Article RNA-sequencing data is widely used to identify disease biomarkers and therapeutic targets using numerical methods such as clustering, classification, regression, and differential expression analysis. Such approaches rely on the assumption that mRNA abundance estimates from RNA-seq are reliable estimates of true expression levels. Here, using data from five RNA-seq processing pipelines applied to 6,690 human tumor and normal tissues, we show that nearly 88% of protein-coding genes have similar gene expression profiles across all pipelines. However, for >12% of protein-coding genes, current best-in-class RNA-seq processing pipelines differ in their abundance estimates by more than four-fold when applied to exactly the same samples and the same set of RNA-seq reads. Expression fold changes are similarly affected. Many of the impacted genes are widely studied disease-associated genes. We show that impacted genes exhibit diverse patterns of discordance among pipelines, suggesting that many inter-pipeline differences contribute to overall uncertainty in mRNA abundance estimates. A concerted, community-wide effort will be needed to develop gold-standards for estimating the mRNA abundance of the discordant genes reported here. In the meantime, our list of discordantly evaluated genes provides an important resource for robust marker discovery and target selection. Nature Publishing Group UK 2020-02-17 /pmc/articles/PMC7026138/ /pubmed/32066774 http://dx.doi.org/10.1038/s41598-020-59516-z Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Arora, Sonali
Pattwell, Siobhan S.
Holland, Eric C.
Bolouri, Hamid
Variability in estimated gene expression among commonly used RNA-seq pipelines
title Variability in estimated gene expression among commonly used RNA-seq pipelines
title_full Variability in estimated gene expression among commonly used RNA-seq pipelines
title_fullStr Variability in estimated gene expression among commonly used RNA-seq pipelines
title_full_unstemmed Variability in estimated gene expression among commonly used RNA-seq pipelines
title_short Variability in estimated gene expression among commonly used RNA-seq pipelines
title_sort variability in estimated gene expression among commonly used rna-seq pipelines
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7026138/
https://www.ncbi.nlm.nih.gov/pubmed/32066774
http://dx.doi.org/10.1038/s41598-020-59516-z
work_keys_str_mv AT arorasonali variabilityinestimatedgeneexpressionamongcommonlyusedrnaseqpipelines
AT pattwellsiobhans variabilityinestimatedgeneexpressionamongcommonlyusedrnaseqpipelines
AT hollandericc variabilityinestimatedgeneexpressionamongcommonlyusedrnaseqpipelines
AT bolourihamid variabilityinestimatedgeneexpressionamongcommonlyusedrnaseqpipelines