Cargando…

Empirical assessment of sequencing errors for high throughput pyrosequencing data

BACKGROUND: Sequencing-by-synthesis technologies significantly improve over the Sanger method in terms of speed and cost per base. However, they still usually fail to compete in terms of read length and quality. Current high-throughput implementations of the pyrosequencing technique yield reads whos...

Descripción completa

Detalles Bibliográficos
Autores principales: da Fonseca, Paulo GS, Paiva, Jorge AP, Almeida, Luiz GP, Vasconcelos, Ana TR, Freitas, Ana T
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852801/
https://www.ncbi.nlm.nih.gov/pubmed/23339526
http://dx.doi.org/10.1186/1756-0500-6-25
_version_ 1782478728701411328
author da Fonseca, Paulo GS
Paiva, Jorge AP
Almeida, Luiz GP
Vasconcelos, Ana TR
Freitas, Ana T
author_facet da Fonseca, Paulo GS
Paiva, Jorge AP
Almeida, Luiz GP
Vasconcelos, Ana TR
Freitas, Ana T
author_sort da Fonseca, Paulo GS
collection PubMed
description BACKGROUND: Sequencing-by-synthesis technologies significantly improve over the Sanger method in terms of speed and cost per base. However, they still usually fail to compete in terms of read length and quality. Current high-throughput implementations of the pyrosequencing technique yield reads whose length approach those of the capillary electrophoresis method. A less obvious question is whether their quality is affected by platform-specific sequencing errors. RESULTS: We present an empirical study aimed at assessing the quality and characterising sequencing errors for high throughput pyrosequencing data. We have developed a procedure for extracting sequencing error data from genome assemblies and study their characteristics, in particular the length distribution of indel gaps and their relation to the sequence contexts where they occur. We used this procedure to analyse data from three prokaryotic genomes sequenced with the GS FLX technology. We also compared two models previously employed with success for peptide sequence alignment. CONCLUSIONS: We observed an overall very low error rate in the analysed data, with indel errors being much more abundant than substitutions. We also observed a dependence between the length of the gaps and that of the homopolymer context where they occur. As with protein alignments, a power-law model seems to approximate the indel errors more accurately, although the results are not so conclusive as to justify a depart from the commonly used affine gap penalty scheme. In whichever case, however, our procedure can be used to estimate more realistic error model parameters.
format Online
Article
Text
id pubmed-3852801
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38528012013-12-13 Empirical assessment of sequencing errors for high throughput pyrosequencing data da Fonseca, Paulo GS Paiva, Jorge AP Almeida, Luiz GP Vasconcelos, Ana TR Freitas, Ana T BMC Res Notes Research Article BACKGROUND: Sequencing-by-synthesis technologies significantly improve over the Sanger method in terms of speed and cost per base. However, they still usually fail to compete in terms of read length and quality. Current high-throughput implementations of the pyrosequencing technique yield reads whose length approach those of the capillary electrophoresis method. A less obvious question is whether their quality is affected by platform-specific sequencing errors. RESULTS: We present an empirical study aimed at assessing the quality and characterising sequencing errors for high throughput pyrosequencing data. We have developed a procedure for extracting sequencing error data from genome assemblies and study their characteristics, in particular the length distribution of indel gaps and their relation to the sequence contexts where they occur. We used this procedure to analyse data from three prokaryotic genomes sequenced with the GS FLX technology. We also compared two models previously employed with success for peptide sequence alignment. CONCLUSIONS: We observed an overall very low error rate in the analysed data, with indel errors being much more abundant than substitutions. We also observed a dependence between the length of the gaps and that of the homopolymer context where they occur. As with protein alignments, a power-law model seems to approximate the indel errors more accurately, although the results are not so conclusive as to justify a depart from the commonly used affine gap penalty scheme. In whichever case, however, our procedure can be used to estimate more realistic error model parameters. BioMed Central 2013-01-22 /pmc/articles/PMC3852801/ /pubmed/23339526 http://dx.doi.org/10.1186/1756-0500-6-25 Text en Copyright © 2013 da Fonseca et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
da Fonseca, Paulo GS
Paiva, Jorge AP
Almeida, Luiz GP
Vasconcelos, Ana TR
Freitas, Ana T
Empirical assessment of sequencing errors for high throughput pyrosequencing data
title Empirical assessment of sequencing errors for high throughput pyrosequencing data
title_full Empirical assessment of sequencing errors for high throughput pyrosequencing data
title_fullStr Empirical assessment of sequencing errors for high throughput pyrosequencing data
title_full_unstemmed Empirical assessment of sequencing errors for high throughput pyrosequencing data
title_short Empirical assessment of sequencing errors for high throughput pyrosequencing data
title_sort empirical assessment of sequencing errors for high throughput pyrosequencing data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852801/
https://www.ncbi.nlm.nih.gov/pubmed/23339526
http://dx.doi.org/10.1186/1756-0500-6-25
work_keys_str_mv AT dafonsecapaulogs empiricalassessmentofsequencingerrorsforhighthroughputpyrosequencingdata
AT paivajorgeap empiricalassessmentofsequencingerrorsforhighthroughputpyrosequencingdata
AT almeidaluizgp empiricalassessmentofsequencingerrorsforhighthroughputpyrosequencingdata
AT vasconcelosanatr empiricalassessmentofsequencingerrorsforhighthroughputpyrosequencingdata
AT freitasanat empiricalassessmentofsequencingerrorsforhighthroughputpyrosequencingdata