Cargando…

NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors

BACKGROUND: Advances in Illumina DNA sequencing technology have produced longer paired-end reads that increasingly have sequence overlaps. These reads can be merged into a single read that spans the full length of the original DNA fragment, allowing for error correction and accurate determination of...

Descripción completa

Detalles Bibliográficos
Autor principal: Gaspar, John M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302405/
https://www.ncbi.nlm.nih.gov/pubmed/30572828
http://dx.doi.org/10.1186/s12859-018-2579-2
_version_ 1783381971369459712
author Gaspar, John M.
author_facet Gaspar, John M.
author_sort Gaspar, John M.
collection PubMed
description BACKGROUND: Advances in Illumina DNA sequencing technology have produced longer paired-end reads that increasingly have sequence overlaps. These reads can be merged into a single read that spans the full length of the original DNA fragment, allowing for error correction and accurate determination of read coverage. Extant merging programs utilize simplistic or unverified models for the selection of bases and quality scores for the overlapping region of merged reads. RESULTS: We first examined the baseline quality score - error rate relationship using sequence reads derived from PhiX. In contrast to numerous published reports, we found that the quality scores produced by Illumina were not substantially inflated above the theoretical values, once the reference genome was corrected for unreported sequence variants. The PhiX reads were then used to create empirical models of sequencing errors in overlapping regions of paired-end reads, and these models were incorporated into a novel merging program, NGmerge. We demonstrate that NGmerge corrects errors and ambiguous bases better than other merging programs, and that it assigns quality scores for merged bases that accurately reflect the error rates. Our results also show that, contrary to published analyses, the sequencing errors of paired-end reads are not independent. CONCLUSIONS: We provide a free and open-source program, NGmerge, that performs better than existing read merging programs. NGmerge is available on GitHub (https://github.com/harvardinformatics/NGmerge) under the MIT License; it is written in C and supported on Linux. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2579-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6302405
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63024052018-12-31 NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors Gaspar, John M. BMC Bioinformatics Methodology Article BACKGROUND: Advances in Illumina DNA sequencing technology have produced longer paired-end reads that increasingly have sequence overlaps. These reads can be merged into a single read that spans the full length of the original DNA fragment, allowing for error correction and accurate determination of read coverage. Extant merging programs utilize simplistic or unverified models for the selection of bases and quality scores for the overlapping region of merged reads. RESULTS: We first examined the baseline quality score - error rate relationship using sequence reads derived from PhiX. In contrast to numerous published reports, we found that the quality scores produced by Illumina were not substantially inflated above the theoretical values, once the reference genome was corrected for unreported sequence variants. The PhiX reads were then used to create empirical models of sequencing errors in overlapping regions of paired-end reads, and these models were incorporated into a novel merging program, NGmerge. We demonstrate that NGmerge corrects errors and ambiguous bases better than other merging programs, and that it assigns quality scores for merged bases that accurately reflect the error rates. Our results also show that, contrary to published analyses, the sequencing errors of paired-end reads are not independent. CONCLUSIONS: We provide a free and open-source program, NGmerge, that performs better than existing read merging programs. NGmerge is available on GitHub (https://github.com/harvardinformatics/NGmerge) under the MIT License; it is written in C and supported on Linux. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2579-2) contains supplementary material, which is available to authorized users. BioMed Central 2018-12-20 /pmc/articles/PMC6302405/ /pubmed/30572828 http://dx.doi.org/10.1186/s12859-018-2579-2 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Gaspar, John M.
NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
title NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
title_full NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
title_fullStr NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
title_full_unstemmed NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
title_short NGmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
title_sort ngmerge: merging paired-end reads via novel empirically-derived models of sequencing errors
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6302405/
https://www.ncbi.nlm.nih.gov/pubmed/30572828
http://dx.doi.org/10.1186/s12859-018-2579-2
work_keys_str_mv AT gasparjohnm ngmergemergingpairedendreadsvianovelempiricallyderivedmodelsofsequencingerrors