Cargando…

Improved assembly of noisy long reads by k-mer validation

Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However...

Descripción completa

Detalles Bibliográficos
Autores principales: Carvalho, Antonio Bernardo, Dupim, Eduardo G., Goldstein, Gabriel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5131822/
https://www.ncbi.nlm.nih.gov/pubmed/27831497
http://dx.doi.org/10.1101/gr.209247.116
_version_ 1782470951591477248
author Carvalho, Antonio Bernardo
Dupim, Eduardo G.
Goldstein, Gabriel
author_facet Carvalho, Antonio Bernardo
Dupim, Eduardo G.
Goldstein, Gabriel
author_sort Carvalho, Antonio Bernardo
collection PubMed
description Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.
format Online
Article
Text
id pubmed-5131822
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-51318222017-06-01 Improved assembly of noisy long reads by k-mer validation Carvalho, Antonio Bernardo Dupim, Eduardo G. Goldstein, Gabriel Genome Res Method Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively. Cold Spring Harbor Laboratory Press 2016-12 /pmc/articles/PMC5131822/ /pubmed/27831497 http://dx.doi.org/10.1101/gr.209247.116 Text en © 2016 Carvalho et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Method
Carvalho, Antonio Bernardo
Dupim, Eduardo G.
Goldstein, Gabriel
Improved assembly of noisy long reads by k-mer validation
title Improved assembly of noisy long reads by k-mer validation
title_full Improved assembly of noisy long reads by k-mer validation
title_fullStr Improved assembly of noisy long reads by k-mer validation
title_full_unstemmed Improved assembly of noisy long reads by k-mer validation
title_short Improved assembly of noisy long reads by k-mer validation
title_sort improved assembly of noisy long reads by k-mer validation
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5131822/
https://www.ncbi.nlm.nih.gov/pubmed/27831497
http://dx.doi.org/10.1101/gr.209247.116
work_keys_str_mv AT carvalhoantoniobernardo improvedassemblyofnoisylongreadsbykmervalidation
AT dupimeduardog improvedassemblyofnoisylongreadsbykmervalidation
AT goldsteingabriel improvedassemblyofnoisylongreadsbykmervalidation