Cargando…

Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias

Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a...

Descripción completa

Detalles Bibliográficos
Autores principales: Mandelboum, Shir, Manber, Zohar, Elroy-Stein, Orna, Elkon, Ran
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6850523/
https://www.ncbi.nlm.nih.gov/pubmed/31714939
http://dx.doi.org/10.1371/journal.pbio.3000481
_version_ 1783469442934505472
author Mandelboum, Shir
Manber, Zohar
Elroy-Stein, Orna
Elkon, Ran
author_facet Mandelboum, Shir
Manber, Zohar
Elroy-Stein, Orna
Elkon, Ran
author_sort Mandelboum, Shir
collection PubMed
description Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, we found that application of gene-set tests that take into account gene–gene correlations attenuates false positive rates caused by the length bias, but statistical power is reduced as well. Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data.
format Online
Article
Text
id pubmed-6850523
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-68505232019-11-22 Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias Mandelboum, Shir Manber, Zohar Elroy-Stein, Orna Elkon, Ran PLoS Biol Meta-Research Article Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, we found that application of gene-set tests that take into account gene–gene correlations attenuates false positive rates caused by the length bias, but statistical power is reduced as well. Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data. Public Library of Science 2019-11-12 /pmc/articles/PMC6850523/ /pubmed/31714939 http://dx.doi.org/10.1371/journal.pbio.3000481 Text en © 2019 Mandelboum et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Meta-Research Article
Mandelboum, Shir
Manber, Zohar
Elroy-Stein, Orna
Elkon, Ran
Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias
title Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias
title_full Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias
title_fullStr Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias
title_full_unstemmed Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias
title_short Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias
title_sort recurrent functional misinterpretation of rna-seq data caused by sample-specific gene length bias
topic Meta-Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6850523/
https://www.ncbi.nlm.nih.gov/pubmed/31714939
http://dx.doi.org/10.1371/journal.pbio.3000481
work_keys_str_mv AT mandelboumshir recurrentfunctionalmisinterpretationofrnaseqdatacausedbysamplespecificgenelengthbias
AT manberzohar recurrentfunctionalmisinterpretationofrnaseqdatacausedbysamplespecificgenelengthbias
AT elroysteinorna recurrentfunctionalmisinterpretationofrnaseqdatacausedbysamplespecificgenelengthbias
AT elkonran recurrentfunctionalmisinterpretationofrnaseqdatacausedbysamplespecificgenelengthbias