Cargando…

Bias detection and correction in RNA-Sequencing data

BACKGROUND: High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng, Wei, Chung, Lisa M, Zhao, Hongyu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149584/
https://www.ncbi.nlm.nih.gov/pubmed/21771300
http://dx.doi.org/10.1186/1471-2105-12-290
_version_ 1782209466745225216
author Zheng, Wei
Chung, Lisa M
Zhao, Hongyu
author_facet Zheng, Wei
Chung, Lisa M
Zhao, Hongyu
author_sort Zheng, Wei
collection PubMed
description BACKGROUND: High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates. RESULTS: In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively. CONCLUSIONS: Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.
format Online
Article
Text
id pubmed-3149584
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31495842011-08-04 Bias detection and correction in RNA-Sequencing data Zheng, Wei Chung, Lisa M Zhao, Hongyu BMC Bioinformatics Research Article BACKGROUND: High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates. RESULTS: In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively. CONCLUSIONS: Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols. BioMed Central 2011-07-19 /pmc/articles/PMC3149584/ /pubmed/21771300 http://dx.doi.org/10.1186/1471-2105-12-290 Text en Copyright ©2011 Zheng et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Zheng, Wei
Chung, Lisa M
Zhao, Hongyu
Bias detection and correction in RNA-Sequencing data
title Bias detection and correction in RNA-Sequencing data
title_full Bias detection and correction in RNA-Sequencing data
title_fullStr Bias detection and correction in RNA-Sequencing data
title_full_unstemmed Bias detection and correction in RNA-Sequencing data
title_short Bias detection and correction in RNA-Sequencing data
title_sort bias detection and correction in rna-sequencing data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149584/
https://www.ncbi.nlm.nih.gov/pubmed/21771300
http://dx.doi.org/10.1186/1471-2105-12-290
work_keys_str_mv AT zhengwei biasdetectionandcorrectioninrnasequencingdata
AT chunglisam biasdetectionandcorrectioninrnasequencingdata
AT zhaohongyu biasdetectionandcorrectioninrnasequencingdata