Cargando…

Finding the active genes in deep RNA-seq gene expression studies

BACKGROUND: Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested th...

Descripción completa

Detalles Bibliográficos
Autores principales: Hart, Traver, Komori, H Kiyomi, LaMere, Sarah, Podshivalova, Katie, Salomon, Daniel R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3870982/
https://www.ncbi.nlm.nih.gov/pubmed/24215113
http://dx.doi.org/10.1186/1471-2164-14-778
_version_ 1782296751879749632
author Hart, Traver
Komori, H Kiyomi
LaMere, Sarah
Podshivalova, Katie
Salomon, Daniel R
author_facet Hart, Traver
Komori, H Kiyomi
LaMere, Sarah
Podshivalova, Katie
Salomon, Daniel R
author_sort Hart, Traver
collection PubMed
description BACKGROUND: Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested that low-abundance transcripts might be the result of technical or biological noise rather than active transcripts; moreover, most RNA-seq experiments did not provide enough read depth to generate high-confidence estimates of gene expression for low-abundance transcripts. As a result, the community adopted several heuristics for RNA-seq analysis, most notably an arbitrary expression threshold of 0.3 - 1 FPKM for downstream analysis. However, advances in RNA-seq library preparation, sequencing technology, and informatic analysis have addressed many of the systemic sources of uncertainty and undermined the assumptions that drove the adoption of these heuristics. We provide an updated view of the accuracy and efficiency of RNA-seq experiments, using genomic data from large-scale studies like the ENCODE project to provide orthogonal information against which to validate our conclusions. RESULTS: We show that a human cell’s transcriptome can be divided into active genes carrying out the work of the cell and other genes that are likely the by-products of biological or experimental noise. We use ENCODE data on chromatin state to show that ultralow-expression genes are predominantly associated with repressed chromatin; we provide a novel normalization metric, zFPKM, that identifies the threshold between active and background gene expression; and we show that this threshold is robust to experimental and analytical variations. CONCLUSIONS: The zFPKM normalization method accurately separates the biologically relevant genes in a cell, which are associated with active promoters, from the ultralow-expression noisy genes that have repressed promoters. A read depth of twenty to thirty million mapped reads allows high-confidence quantitation of genes expressed at this threshold, providing important guidance for the design of RNA-seq studies of gene expression. Moreover, we offer an example for using extensive ENCODE chromatin state information to validate RNA-seq analysis pipelines.
format Online
Article
Text
id pubmed-3870982
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38709822013-12-27 Finding the active genes in deep RNA-seq gene expression studies Hart, Traver Komori, H Kiyomi LaMere, Sarah Podshivalova, Katie Salomon, Daniel R BMC Genomics Methodology Article BACKGROUND: Early application of second-generation sequencing technologies to transcript quantitation (RNA-seq) has hinted at a vast mammalian transcriptome, including transcripts from nearly all known genes, which might be fully measured only by ultradeep sequencing. Subsequent studies suggested that low-abundance transcripts might be the result of technical or biological noise rather than active transcripts; moreover, most RNA-seq experiments did not provide enough read depth to generate high-confidence estimates of gene expression for low-abundance transcripts. As a result, the community adopted several heuristics for RNA-seq analysis, most notably an arbitrary expression threshold of 0.3 - 1 FPKM for downstream analysis. However, advances in RNA-seq library preparation, sequencing technology, and informatic analysis have addressed many of the systemic sources of uncertainty and undermined the assumptions that drove the adoption of these heuristics. We provide an updated view of the accuracy and efficiency of RNA-seq experiments, using genomic data from large-scale studies like the ENCODE project to provide orthogonal information against which to validate our conclusions. RESULTS: We show that a human cell’s transcriptome can be divided into active genes carrying out the work of the cell and other genes that are likely the by-products of biological or experimental noise. We use ENCODE data on chromatin state to show that ultralow-expression genes are predominantly associated with repressed chromatin; we provide a novel normalization metric, zFPKM, that identifies the threshold between active and background gene expression; and we show that this threshold is robust to experimental and analytical variations. CONCLUSIONS: The zFPKM normalization method accurately separates the biologically relevant genes in a cell, which are associated with active promoters, from the ultralow-expression noisy genes that have repressed promoters. A read depth of twenty to thirty million mapped reads allows high-confidence quantitation of genes expressed at this threshold, providing important guidance for the design of RNA-seq studies of gene expression. Moreover, we offer an example for using extensive ENCODE chromatin state information to validate RNA-seq analysis pipelines. BioMed Central 2013-11-11 /pmc/articles/PMC3870982/ /pubmed/24215113 http://dx.doi.org/10.1186/1471-2164-14-778 Text en Copyright © 2013 Hart et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Hart, Traver
Komori, H Kiyomi
LaMere, Sarah
Podshivalova, Katie
Salomon, Daniel R
Finding the active genes in deep RNA-seq gene expression studies
title Finding the active genes in deep RNA-seq gene expression studies
title_full Finding the active genes in deep RNA-seq gene expression studies
title_fullStr Finding the active genes in deep RNA-seq gene expression studies
title_full_unstemmed Finding the active genes in deep RNA-seq gene expression studies
title_short Finding the active genes in deep RNA-seq gene expression studies
title_sort finding the active genes in deep rna-seq gene expression studies
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3870982/
https://www.ncbi.nlm.nih.gov/pubmed/24215113
http://dx.doi.org/10.1186/1471-2164-14-778
work_keys_str_mv AT harttraver findingtheactivegenesindeeprnaseqgeneexpressionstudies
AT komorihkiyomi findingtheactivegenesindeeprnaseqgeneexpressionstudies
AT lameresarah findingtheactivegenesindeeprnaseqgeneexpressionstudies
AT podshivalovakatie findingtheactivegenesindeeprnaseqgeneexpressionstudies
AT salomondanielr findingtheactivegenesindeeprnaseqgeneexpressionstudies