Cargando…

Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size

RNA-Seq is a widely used technology that allows an efficient genome-wide quantification of gene expressions for, for example, differential expression (DE) analysis. After a brief review of the main issues, methods and tools related to the DE analysis of RNA-Seq data, this article focuses on the impa...

Descripción completa

Detalles Bibliográficos
Autores principales: Lamarre, Sophie, Frasse, Pierre, Zouine, Mohamed, Labourdette, Delphine, Sainderichin, Elise, Hu, Guojian, Le Berre-Anton, Véronique, Bouzayen, Mondher, Maza, Elie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5817962/
https://www.ncbi.nlm.nih.gov/pubmed/29491871
http://dx.doi.org/10.3389/fpls.2018.00108
_version_ 1783300956630286336
author Lamarre, Sophie
Frasse, Pierre
Zouine, Mohamed
Labourdette, Delphine
Sainderichin, Elise
Hu, Guojian
Le Berre-Anton, Véronique
Bouzayen, Mondher
Maza, Elie
author_facet Lamarre, Sophie
Frasse, Pierre
Zouine, Mohamed
Labourdette, Delphine
Sainderichin, Elise
Hu, Guojian
Le Berre-Anton, Véronique
Bouzayen, Mondher
Maza, Elie
author_sort Lamarre, Sophie
collection PubMed
description RNA-Seq is a widely used technology that allows an efficient genome-wide quantification of gene expressions for, for example, differential expression (DE) analysis. After a brief review of the main issues, methods and tools related to the DE analysis of RNA-Seq data, this article focuses on the impact of both the replicate number and library size in such analyses. While the main drawback of previous relevant studies is the lack of generality, we conducted both an analysis of a two-condition experiment (with eight biological replicates per condition) to compare the results with previous benchmark studies, and a meta-analysis of 17 experiments with up to 18 biological conditions, eight biological replicates and 100 million (M) reads per sample. As a global trend, we concluded that the replicate number has a larger impact than the library size on the power of the DE analysis, except for low-expressed genes, for which both parameters seem to have the same impact. Our study also provides new insights for practitioners aiming to enhance their experimental designs. For instance, by analyzing both the sensitivity and specificity of the DE analysis, we showed that the optimal threshold to control the false discovery rate (FDR) is approximately 2(−r), where r is the replicate number. Furthermore, we showed that the false positive rate (FPR) is rather well controlled by all three studied R packages: DESeq, DESeq2, and edgeR. We also analyzed the impact of both the replicate number and library size on gene ontology (GO) enrichment analysis. Interestingly, we concluded that increases in the replicate number and library size tend to enhance the sensitivity and specificity, respectively, of the GO analysis. Finally, we recommend to RNA-Seq practitioners the production of a pilot data set to strictly analyze the power of their experimental design, or the use of a public data set, which should be similar to the data set they will obtain. For individuals working on tomato research, on the basis of the meta-analysis, we recommend at least four biological replicates per condition and 20 M reads per sample to be almost sure of obtaining about 1000 DE genes if they exist.
format Online
Article
Text
id pubmed-5817962
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-58179622018-02-28 Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size Lamarre, Sophie Frasse, Pierre Zouine, Mohamed Labourdette, Delphine Sainderichin, Elise Hu, Guojian Le Berre-Anton, Véronique Bouzayen, Mondher Maza, Elie Front Plant Sci Plant Science RNA-Seq is a widely used technology that allows an efficient genome-wide quantification of gene expressions for, for example, differential expression (DE) analysis. After a brief review of the main issues, methods and tools related to the DE analysis of RNA-Seq data, this article focuses on the impact of both the replicate number and library size in such analyses. While the main drawback of previous relevant studies is the lack of generality, we conducted both an analysis of a two-condition experiment (with eight biological replicates per condition) to compare the results with previous benchmark studies, and a meta-analysis of 17 experiments with up to 18 biological conditions, eight biological replicates and 100 million (M) reads per sample. As a global trend, we concluded that the replicate number has a larger impact than the library size on the power of the DE analysis, except for low-expressed genes, for which both parameters seem to have the same impact. Our study also provides new insights for practitioners aiming to enhance their experimental designs. For instance, by analyzing both the sensitivity and specificity of the DE analysis, we showed that the optimal threshold to control the false discovery rate (FDR) is approximately 2(−r), where r is the replicate number. Furthermore, we showed that the false positive rate (FPR) is rather well controlled by all three studied R packages: DESeq, DESeq2, and edgeR. We also analyzed the impact of both the replicate number and library size on gene ontology (GO) enrichment analysis. Interestingly, we concluded that increases in the replicate number and library size tend to enhance the sensitivity and specificity, respectively, of the GO analysis. Finally, we recommend to RNA-Seq practitioners the production of a pilot data set to strictly analyze the power of their experimental design, or the use of a public data set, which should be similar to the data set they will obtain. For individuals working on tomato research, on the basis of the meta-analysis, we recommend at least four biological replicates per condition and 20 M reads per sample to be almost sure of obtaining about 1000 DE genes if they exist. Frontiers Media S.A. 2018-02-14 /pmc/articles/PMC5817962/ /pubmed/29491871 http://dx.doi.org/10.3389/fpls.2018.00108 Text en Copyright © 2018 Lamarre, Frasse, Zouine, Labourdette, Sainderichin, Hu, Le Berre-Anton, Bouzayen and Maza. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Plant Science
Lamarre, Sophie
Frasse, Pierre
Zouine, Mohamed
Labourdette, Delphine
Sainderichin, Elise
Hu, Guojian
Le Berre-Anton, Véronique
Bouzayen, Mondher
Maza, Elie
Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size
title Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size
title_full Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size
title_fullStr Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size
title_full_unstemmed Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size
title_short Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size
title_sort optimization of an rna-seq differential gene expression analysis depending on biological replicate number and library size
topic Plant Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5817962/
https://www.ncbi.nlm.nih.gov/pubmed/29491871
http://dx.doi.org/10.3389/fpls.2018.00108
work_keys_str_mv AT lamarresophie optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT frassepierre optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT zouinemohamed optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT labourdettedelphine optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT sainderichinelise optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT huguojian optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT leberreantonveronique optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT bouzayenmondher optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize
AT mazaelie optimizationofanrnaseqdifferentialgeneexpressionanalysisdependingonbiologicalreplicatenumberandlibrarysize