Cargando…

Trimming of sequence reads alters RNA-Seq gene expression estimates

BACKGROUND: High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is wide...

Descripción completa

Detalles Bibliográficos
Autores principales: Williams, Claire R., Baccarella, Alyssa, Parrish, Jay Z., Kim, Charles C.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4766705/
https://www.ncbi.nlm.nih.gov/pubmed/26911985
http://dx.doi.org/10.1186/s12859-016-0956-2
_version_ 1782417716038074368
author Williams, Claire R.
Baccarella, Alyssa
Parrish, Jay Z.
Kim, Charles C.
author_facet Williams, Claire R.
Baccarella, Alyssa
Parrish, Jay Z.
Kim, Charles C.
author_sort Williams, Claire R.
collection PubMed
description BACKGROUND: High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. RESULTS: To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms—SolexaQA, Trimmomatic, and ConDeTri—to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates. CONCLUSIONS: We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0956-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4766705
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47667052016-02-26 Trimming of sequence reads alters RNA-Seq gene expression estimates Williams, Claire R. Baccarella, Alyssa Parrish, Jay Z. Kim, Charles C. BMC Bioinformatics Research Article BACKGROUND: High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. RESULTS: To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms—SolexaQA, Trimmomatic, and ConDeTri—to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates. CONCLUSIONS: We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0956-2) contains supplementary material, which is available to authorized users. BioMed Central 2016-02-25 /pmc/articles/PMC4766705/ /pubmed/26911985 http://dx.doi.org/10.1186/s12859-016-0956-2 Text en © Williams et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Williams, Claire R.
Baccarella, Alyssa
Parrish, Jay Z.
Kim, Charles C.
Trimming of sequence reads alters RNA-Seq gene expression estimates
title Trimming of sequence reads alters RNA-Seq gene expression estimates
title_full Trimming of sequence reads alters RNA-Seq gene expression estimates
title_fullStr Trimming of sequence reads alters RNA-Seq gene expression estimates
title_full_unstemmed Trimming of sequence reads alters RNA-Seq gene expression estimates
title_short Trimming of sequence reads alters RNA-Seq gene expression estimates
title_sort trimming of sequence reads alters rna-seq gene expression estimates
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4766705/
https://www.ncbi.nlm.nih.gov/pubmed/26911985
http://dx.doi.org/10.1186/s12859-016-0956-2
work_keys_str_mv AT williamsclairer trimmingofsequencereadsaltersrnaseqgeneexpressionestimates
AT baccarellaalyssa trimmingofsequencereadsaltersrnaseqgeneexpressionestimates
AT parrishjayz trimmingofsequencereadsaltersrnaseqgeneexpressionestimates
AT kimcharlesc trimmingofsequencereadsaltersrnaseqgeneexpressionestimates