Cargando…

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data

BACKGROUND: In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effec...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yoon, Sora, Nam, Dougu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445461/ https://www.ncbi.nlm.nih.gov/pubmed/28545404 http://dx.doi.org/10.1186/s12864-017-3809-0

_version_	1783238896347250688
author	Yoon, Sora Nam, Dougu
author_facet	Yoon, Sora Nam, Dougu
author_sort	Yoon, Sora
collection	PubMed
description	BACKGROUND: In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data. RESULTS: We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not. CONCLUSION: We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-3809-0) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5445461
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-54454612017-05-30 Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data Yoon, Sora Nam, Dougu BMC Genomics Research Article BACKGROUND: In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data. RESULTS: We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not. CONCLUSION: We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-017-3809-0) contains supplementary material, which is available to authorized users. BioMed Central 2017-05-25 /pmc/articles/PMC5445461/ /pubmed/28545404 http://dx.doi.org/10.1186/s12864-017-3809-0 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Yoon, Sora Nam, Dougu Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_full	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_fullStr	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_full_unstemmed	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_short	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_sort	gene dispersion is the key determinant of the read count bias in differential expression analysis of rna-seq data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445461/ https://www.ncbi.nlm.nih.gov/pubmed/28545404 http://dx.doi.org/10.1186/s12864-017-3809-0
work_keys_str_mv	AT yoonsora genedispersionisthekeydeterminantofthereadcountbiasindifferentialexpressionanalysisofrnaseqdata AT namdougu genedispersionisthekeydeterminantofthereadcountbiasindifferentialexpressionanalysisofrnaseqdata

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data

Ejemplares similares