Cargando…
GC-Content Normalization for RNA-Seq Data
BACKGROUND: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensu...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315510/ https://www.ncbi.nlm.nih.gov/pubmed/22177264 http://dx.doi.org/10.1186/1471-2105-12-480 |
_version_ | 1782228245167472640 |
---|---|
author | Risso, Davide Schwartz, Katja Sherlock, Gavin Dudoit, Sandrine |
author_facet | Risso, Davide Schwartz, Katja Sherlock, Gavin Dudoit, Sandrine |
author_sort | Risso, Davide |
collection | PubMed |
description | BACKGROUND: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. RESULTS: We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. CONCLUSIONS: Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes. |
format | Online Article Text |
id | pubmed-3315510 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-33155102012-04-04 GC-Content Normalization for RNA-Seq Data Risso, Davide Schwartz, Katja Sherlock, Gavin Dudoit, Sandrine BMC Bioinformatics Research Article BACKGROUND: Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof. RESULTS: We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq. CONCLUSIONS: Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes. BioMed Central 2011-12-17 /pmc/articles/PMC3315510/ /pubmed/22177264 http://dx.doi.org/10.1186/1471-2105-12-480 Text en Copyright ©2011 Risso et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Risso, Davide Schwartz, Katja Sherlock, Gavin Dudoit, Sandrine GC-Content Normalization for RNA-Seq Data |
title | GC-Content Normalization for RNA-Seq Data |
title_full | GC-Content Normalization for RNA-Seq Data |
title_fullStr | GC-Content Normalization for RNA-Seq Data |
title_full_unstemmed | GC-Content Normalization for RNA-Seq Data |
title_short | GC-Content Normalization for RNA-Seq Data |
title_sort | gc-content normalization for rna-seq data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3315510/ https://www.ncbi.nlm.nih.gov/pubmed/22177264 http://dx.doi.org/10.1186/1471-2105-12-480 |
work_keys_str_mv | AT rissodavide gccontentnormalizationforrnaseqdata AT schwartzkatja gccontentnormalizationforrnaseqdata AT sherlockgavin gccontentnormalizationforrnaseqdata AT dudoitsandrine gccontentnormalizationforrnaseqdata |