Cargando…

How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq

RNA-seq experiments estimate the number of genes expressed in a transcriptome as well as their relative frequencies. However, an undetermined number of genes can remain undetected due to their low expression relative to the sample size (sequence depth). Estimation of the true number of genes express...

Descripción completa

Detalles Bibliográficos
Autores principales: García-Ortega, Luis Fernando, Martínez, Octavio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479379/
https://www.ncbi.nlm.nih.gov/pubmed/26107654
http://dx.doi.org/10.1371/journal.pone.0130262
_version_ 1782378000540499968
author García-Ortega, Luis Fernando
Martínez, Octavio
author_facet García-Ortega, Luis Fernando
Martínez, Octavio
author_sort García-Ortega, Luis Fernando
collection PubMed
description RNA-seq experiments estimate the number of genes expressed in a transcriptome as well as their relative frequencies. However, an undetermined number of genes can remain undetected due to their low expression relative to the sample size (sequence depth). Estimation of the true number of genes expressed in a transcriptome is essential in order to determine which genes are exclusively expressed in specific tissues or under particular conditions. A reliable estimate of the true number of expressed genes is also required to accurately measure transcriptome changes and to predict the sequencing depth needed to increase the proportion of detected genes. This problem is analogous to ecological sampling problems such as estimating the number of species at a given site. Here we present a non-parametric estimator for the number of undetected genes as well as for the extra sample size needed to detect a given proportion of the undetected genes. Our estimators are superior to ones already published by having smaller standard errors and biases. We applied our method to a set of 32 publicly available RNA-seq experiments, including the evaluation of 311 individually sequenced libraries. We found that in the majority of the cases more than one thousand genes are undetected, and that on average approximately 6% of the expressed genes per accession remain undetected. This figure increases to approximately 10% if individual sequencing libraries are analyzed. Our method is also applicable to metagenomic experiments. Using our method, the number of undetected genes as well as the sample size needed to detect them can be calculated, leading to more accurate and complete gene expression studies.
format Online
Article
Text
id pubmed-4479379
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-44793792015-06-29 How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq García-Ortega, Luis Fernando Martínez, Octavio PLoS One Research Article RNA-seq experiments estimate the number of genes expressed in a transcriptome as well as their relative frequencies. However, an undetermined number of genes can remain undetected due to their low expression relative to the sample size (sequence depth). Estimation of the true number of genes expressed in a transcriptome is essential in order to determine which genes are exclusively expressed in specific tissues or under particular conditions. A reliable estimate of the true number of expressed genes is also required to accurately measure transcriptome changes and to predict the sequencing depth needed to increase the proportion of detected genes. This problem is analogous to ecological sampling problems such as estimating the number of species at a given site. Here we present a non-parametric estimator for the number of undetected genes as well as for the extra sample size needed to detect a given proportion of the undetected genes. Our estimators are superior to ones already published by having smaller standard errors and biases. We applied our method to a set of 32 publicly available RNA-seq experiments, including the evaluation of 311 individually sequenced libraries. We found that in the majority of the cases more than one thousand genes are undetected, and that on average approximately 6% of the expressed genes per accession remain undetected. This figure increases to approximately 10% if individual sequencing libraries are analyzed. Our method is also applicable to metagenomic experiments. Using our method, the number of undetected genes as well as the sample size needed to detect them can be calculated, leading to more accurate and complete gene expression studies. Public Library of Science 2015-06-24 /pmc/articles/PMC4479379/ /pubmed/26107654 http://dx.doi.org/10.1371/journal.pone.0130262 Text en © 2015 García-Ortega, Martínez http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
García-Ortega, Luis Fernando
Martínez, Octavio
How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq
title How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq
title_full How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq
title_fullStr How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq
title_full_unstemmed How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq
title_short How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq
title_sort how many genes are expressed in a transcriptome? estimation and results for rna-seq
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479379/
https://www.ncbi.nlm.nih.gov/pubmed/26107654
http://dx.doi.org/10.1371/journal.pone.0130262
work_keys_str_mv AT garciaortegaluisfernando howmanygenesareexpressedinatranscriptomeestimationandresultsforrnaseq
AT martinezoctavio howmanygenesareexpressedinatranscriptomeestimationandresultsforrnaseq