Cargando…

Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing

While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probab...

Descripción completa

Detalles Bibliográficos
Autores principales: Zook, Justin M., Samarov, Daniel, McDaniel, Jennifer, Sen, Shurjo K., Salit, Marc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3409179/
https://www.ncbi.nlm.nih.gov/pubmed/22859977
http://dx.doi.org/10.1371/journal.pone.0041356
_version_ 1782239553406369792
author Zook, Justin M.
Samarov, Daniel
McDaniel, Jennifer
Sen, Shurjo K.
Salit, Marc
author_facet Zook, Justin M.
Samarov, Daniel
McDaniel, Jennifer
Sen, Shurjo K.
Salit, Marc
author_sort Zook, Justin M.
collection PubMed
description While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration.
format Online
Article
Text
id pubmed-3409179
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-34091792012-08-02 Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing Zook, Justin M. Samarov, Daniel McDaniel, Jennifer Sen, Shurjo K. Salit, Marc PLoS One Research Article While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Public Library of Science 2012-07-31 /pmc/articles/PMC3409179/ /pubmed/22859977 http://dx.doi.org/10.1371/journal.pone.0041356 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose.
spellingShingle Research Article
Zook, Justin M.
Samarov, Daniel
McDaniel, Jennifer
Sen, Shurjo K.
Salit, Marc
Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
title Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
title_full Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
title_fullStr Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
title_full_unstemmed Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
title_short Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
title_sort synthetic spike-in standards improve run-specific systematic error analysis for dna and rna sequencing
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3409179/
https://www.ncbi.nlm.nih.gov/pubmed/22859977
http://dx.doi.org/10.1371/journal.pone.0041356
work_keys_str_mv AT zookjustinm syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing
AT samarovdaniel syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing
AT mcdanieljennifer syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing
AT senshurjok syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing
AT salitmarc syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing