Cargando…

GC-Content Normalization for RNA-Seq Data

BACKGROUND: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensu...

Descripción completa

Detalles Bibliográficos
Autores principales: Risso, Davide, Schwartz, Katja, Sherlock, Gavin, Dudoit, Sandrine
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315510/
https://www.ncbi.nlm.nih.gov/pubmed/22177264
http://dx.doi.org/10.1186/1471-2105-12-480
_version_ 1782228245167472640
author Risso, Davide
Schwartz, Katja
Sherlock, Gavin
Dudoit, Sandrine
author_facet Risso, Davide
Schwartz, Katja
Sherlock, Gavin
Dudoit, Sandrine
author_sort Risso, Davide
collection PubMed
description BACKGROUND: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. RESULTS: We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. CONCLUSIONS: Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes.
format Online
Article
Text
id pubmed-3315510
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33155102012-04-04 GC-Content Normalization for RNA-Seq Data Risso, Davide Schwartz, Katja Sherlock, Gavin Dudoit, Sandrine BMC Bioinformatics Research Article BACKGROUND: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. RESULTS: We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. CONCLUSIONS: Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes. BioMed Central 2011-12-17 /pmc/articles/PMC3315510/ /pubmed/22177264 http://dx.doi.org/10.1186/1471-2105-12-480 Text en Copyright ©2011 Risso et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Risso, Davide
Schwartz, Katja
Sherlock, Gavin
Dudoit, Sandrine
GC-Content Normalization for RNA-Seq Data
title GC-Content Normalization for RNA-Seq Data
title_full GC-Content Normalization for RNA-Seq Data
title_fullStr GC-Content Normalization for RNA-Seq Data
title_full_unstemmed GC-Content Normalization for RNA-Seq Data
title_short GC-Content Normalization for RNA-Seq Data
title_sort gc-content normalization for rna-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315510/
https://www.ncbi.nlm.nih.gov/pubmed/22177264
http://dx.doi.org/10.1186/1471-2105-12-480
work_keys_str_mv AT rissodavide gccontentnormalizationforrnaseqdata
AT schwartzkatja gccontentnormalizationforrnaseqdata
AT sherlockgavin gccontentnormalizationforrnaseqdata
AT dudoitsandrine gccontentnormalizationforrnaseqdata