Cargando…

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes...

Descripción completa

Detalles Bibliográficos
Autores principales: Koren, Sergey, Walenz, Brian P., Berlin, Konstantin, Miller, Jason R., Bergman, Nicholas H., Phillippy, Adam M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411767/
https://www.ncbi.nlm.nih.gov/pubmed/28298431
http://dx.doi.org/10.1101/gr.215087.116
_version_ 1783232861802856448
author Koren, Sergey
Walenz, Brian P.
Berlin, Konstantin
Miller, Jason R.
Bergman, Nicholas H.
Phillippy, Adam M.
author_facet Koren, Sergey
Walenz, Brian P.
Berlin, Konstantin
Miller, Jason R.
Bergman, Nicholas H.
Phillippy, Adam M.
author_sort Koren, Sergey
collection PubMed
description Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
format Online
Article
Text
id pubmed-5411767
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-54117672017-05-16 Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation Koren, Sergey Walenz, Brian P. Berlin, Konstantin Miller, Jason R. Bergman, Nicholas H. Phillippy, Adam M. Genome Res Method Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes. Cold Spring Harbor Laboratory Press 2017-05 /pmc/articles/PMC5411767/ /pubmed/28298431 http://dx.doi.org/10.1101/gr.215087.116 Text en © 2017 Koren et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by/4.0/ This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.
spellingShingle Method
Koren, Sergey
Walenz, Brian P.
Berlin, Konstantin
Miller, Jason R.
Bergman, Nicholas H.
Phillippy, Adam M.
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
title Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
title_full Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
title_fullStr Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
title_full_unstemmed Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
title_short Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
title_sort canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411767/
https://www.ncbi.nlm.nih.gov/pubmed/28298431
http://dx.doi.org/10.1101/gr.215087.116
work_keys_str_mv AT korensergey canuscalableandaccuratelongreadassemblyviaadaptivekmerweightingandrepeatseparation
AT walenzbrianp canuscalableandaccuratelongreadassemblyviaadaptivekmerweightingandrepeatseparation
AT berlinkonstantin canuscalableandaccuratelongreadassemblyviaadaptivekmerweightingandrepeatseparation
AT millerjasonr canuscalableandaccuratelongreadassemblyviaadaptivekmerweightingandrepeatseparation
AT bergmannicholash canuscalableandaccuratelongreadassemblyviaadaptivekmerweightingandrepeatseparation
AT phillippyadamm canuscalableandaccuratelongreadassemblyviaadaptivekmerweightingandrepeatseparation