Cargando…

Identification and correction of systematic error in high-throughput sequence data

BACKGROUND: A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed "n...

Descripción completa

Detalles Bibliográficos
Autores principales: Meacham, Frazer, Boffelli, Dario, Dhahbi, Joseph, Martin, David IK, Singer, Meromit, Pachter, Lior
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3295828/
https://www.ncbi.nlm.nih.gov/pubmed/22099972
http://dx.doi.org/10.1186/1471-2105-12-451
_version_ 1782225651417219072
author Meacham, Frazer
Boffelli, Dario
Dhahbi, Joseph
Martin, David IK
Singer, Meromit
Pachter, Lior
author_facet Meacham, Frazer
Boffelli, Dario
Dhahbi, Joseph
Martin, David IK
Singer, Meromit
Pachter, Lior
author_sort Meacham, Frazer
collection PubMed
description BACKGROUND: A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed "next-gen" sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of systematic error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations. RESULTS: We characterize and describe systematic errors using overlapping paired reads from high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that they are highly replicable across experiments. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq), and can be used with single-end datasets. CONCLUSIONS: Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments.
format Online
Article
Text
id pubmed-3295828
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32958282012-03-08 Identification and correction of systematic error in high-throughput sequence data Meacham, Frazer Boffelli, Dario Dhahbi, Joseph Martin, David IK Singer, Meromit Pachter, Lior BMC Bioinformatics Research Article BACKGROUND: A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed "next-gen" sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of systematic error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations. RESULTS: We characterize and describe systematic errors using overlapping paired reads from high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that they are highly replicable across experiments. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq), and can be used with single-end datasets. CONCLUSIONS: Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments. BioMed Central 2011-11-21 /pmc/articles/PMC3295828/ /pubmed/22099972 http://dx.doi.org/10.1186/1471-2105-12-451 Text en Copyright ©2011 Meacham et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Meacham, Frazer
Boffelli, Dario
Dhahbi, Joseph
Martin, David IK
Singer, Meromit
Pachter, Lior
Identification and correction of systematic error in high-throughput sequence data
title Identification and correction of systematic error in high-throughput sequence data
title_full Identification and correction of systematic error in high-throughput sequence data
title_fullStr Identification and correction of systematic error in high-throughput sequence data
title_full_unstemmed Identification and correction of systematic error in high-throughput sequence data
title_short Identification and correction of systematic error in high-throughput sequence data
title_sort identification and correction of systematic error in high-throughput sequence data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3295828/
https://www.ncbi.nlm.nih.gov/pubmed/22099972
http://dx.doi.org/10.1186/1471-2105-12-451
work_keys_str_mv AT meachamfrazer identificationandcorrectionofsystematicerrorinhighthroughputsequencedata
AT boffellidario identificationandcorrectionofsystematicerrorinhighthroughputsequencedata
AT dhahbijoseph identificationandcorrectionofsystematicerrorinhighthroughputsequencedata
AT martindavidik identificationandcorrectionofsystematicerrorinhighthroughputsequencedata
AT singermeromit identificationandcorrectionofsystematicerrorinhighthroughputsequencedata
AT pachterlior identificationandcorrectionofsystematicerrorinhighthroughputsequencedata