Cargando…

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly

Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we...

Descripción completa

Detalles Bibliográficos
Autores principales: Sameith, Katrin, Roscito, Juliana G, Hiller, Michael
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5221426/
https://www.ncbi.nlm.nih.gov/pubmed/26868358
http://dx.doi.org/10.1093/bib/bbw003
_version_ 1782492805735645184
author Sameith, Katrin
Roscito, Juliana G
Hiller, Michael
author_facet Sameith, Katrin
Roscito, Juliana G
Hiller, Michael
author_sort Sameith, Katrin
collection PubMed
description Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.
format Online
Article
Text
id pubmed-5221426
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-52214262017-01-12 Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly Sameith, Katrin Roscito, Juliana G Hiller, Michael Brief Bioinform Paper Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA. Oxford University Press 2017-01 2016-02-10 /pmc/articles/PMC5221426/ /pubmed/26868358 http://dx.doi.org/10.1093/bib/bbw003 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Paper
Sameith, Katrin
Roscito, Juliana G
Hiller, Michael
Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly
title Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly
title_full Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly
title_fullStr Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly
title_full_unstemmed Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly
title_short Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly
title_sort iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly
topic Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5221426/
https://www.ncbi.nlm.nih.gov/pubmed/26868358
http://dx.doi.org/10.1093/bib/bbw003
work_keys_str_mv AT sameithkatrin iterativeerrorcorrectionoflongsequencingreadsmaximizesaccuracyandimprovescontigassembly
AT roscitojulianag iterativeerrorcorrectionoflongsequencingreadsmaximizesaccuracyandimprovescontigassembly
AT hillermichael iterativeerrorcorrectionoflongsequencingreadsmaximizesaccuracyandimprovescontigassembly