Cargando…

Estimation of sequencing error rates in short reads

BACKGROUND: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual e...

Descripción completa

Detalles Bibliográficos
Autores principales: Victoria, Xin, Blades, Natalie, Ding, Jie, Sultana, Razvan, Parmigiani, Giovanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3495688/
https://www.ncbi.nlm.nih.gov/pubmed/22846331
http://dx.doi.org/10.1186/1471-2105-13-185
_version_ 1782249550399930368
author Victoria, Xin
Blades, Natalie
Ding, Jie
Sultana, Razvan
Parmigiani, Giovanni
author_facet Victoria, Xin
Blades, Natalie
Ding, Jie
Sultana, Razvan
Parmigiani, Giovanni
author_sort Victoria, Xin
collection PubMed
description BACKGROUND: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. RESULTS: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html. CONCLUSIONS: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.
format Online
Article
Text
id pubmed-3495688
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34956882012-11-19 Estimation of sequencing error rates in short reads Victoria, Xin Blades, Natalie Ding, Jie Sultana, Razvan Parmigiani, Giovanni BMC Bioinformatics Methodology Article BACKGROUND: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. RESULTS: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∼vwang/shadowRegression.html. CONCLUSIONS: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data. BioMed Central 2012-07-30 /pmc/articles/PMC3495688/ /pubmed/22846331 http://dx.doi.org/10.1186/1471-2105-13-185 Text en Copyright ©2012 Wang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Victoria, Xin
Blades, Natalie
Ding, Jie
Sultana, Razvan
Parmigiani, Giovanni
Estimation of sequencing error rates in short reads
title Estimation of sequencing error rates in short reads
title_full Estimation of sequencing error rates in short reads
title_fullStr Estimation of sequencing error rates in short reads
title_full_unstemmed Estimation of sequencing error rates in short reads
title_short Estimation of sequencing error rates in short reads
title_sort estimation of sequencing error rates in short reads
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3495688/
https://www.ncbi.nlm.nih.gov/pubmed/22846331
http://dx.doi.org/10.1186/1471-2105-13-185
work_keys_str_mv AT victoriaxin estimationofsequencingerrorratesinshortreads
AT bladesnatalie estimationofsequencingerrorratesinshortreads
AT dingjie estimationofsequencingerrorratesinshortreads
AT sultanarazvan estimationofsequencingerrorratesinshortreads
AT parmigianigiovanni estimationofsequencingerrorratesinshortreads