Cargando…
Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing
While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probab...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3409179/ https://www.ncbi.nlm.nih.gov/pubmed/22859977 http://dx.doi.org/10.1371/journal.pone.0041356 |
_version_ | 1782239553406369792 |
---|---|
author | Zook, Justin M. Samarov, Daniel McDaniel, Jennifer Sen, Shurjo K. Salit, Marc |
author_facet | Zook, Justin M. Samarov, Daniel McDaniel, Jennifer Sen, Shurjo K. Salit, Marc |
author_sort | Zook, Justin M. |
collection | PubMed |
description | While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. |
format | Online Article Text |
id | pubmed-3409179 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-34091792012-08-02 Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing Zook, Justin M. Samarov, Daniel McDaniel, Jennifer Sen, Shurjo K. Salit, Marc PLoS One Research Article While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a data set used to calculate association of SSEs with various features in the reads and sequence context. This data set is typically either from a part of the data set being “recalibrated” (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 Phred-scaled quality score units, and by as much as 13 units at CpG sites. In addition, since the spike-in data used for recalibration are independent of the genome being sequenced, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Public Library of Science 2012-07-31 /pmc/articles/PMC3409179/ /pubmed/22859977 http://dx.doi.org/10.1371/journal.pone.0041356 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open-access article distributed under the terms of the Creative Commons Public Domain declaration, which stipulates that, once placed in the public domain, this work may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. |
spellingShingle | Research Article Zook, Justin M. Samarov, Daniel McDaniel, Jennifer Sen, Shurjo K. Salit, Marc Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing |
title | Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing |
title_full | Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing |
title_fullStr | Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing |
title_full_unstemmed | Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing |
title_short | Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing |
title_sort | synthetic spike-in standards improve run-specific systematic error analysis for dna and rna sequencing |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3409179/ https://www.ncbi.nlm.nih.gov/pubmed/22859977 http://dx.doi.org/10.1371/journal.pone.0041356 |
work_keys_str_mv | AT zookjustinm syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT samarovdaniel syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT mcdanieljennifer syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT senshurjok syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing AT salitmarc syntheticspikeinstandardsimproverunspecificsystematicerroranalysisfordnaandrnasequencing |