Cargando…
Improved assembly of noisy long reads by k-mer validation
Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5131822/ https://www.ncbi.nlm.nih.gov/pubmed/27831497 http://dx.doi.org/10.1101/gr.209247.116 |
_version_ | 1782470951591477248 |
---|---|
author | Carvalho, Antonio Bernardo Dupim, Eduardo G. Goldstein, Gabriel |
author_facet | Carvalho, Antonio Bernardo Dupim, Eduardo G. Goldstein, Gabriel |
author_sort | Carvalho, Antonio Bernardo |
collection | PubMed |
description | Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively. |
format | Online Article Text |
id | pubmed-5131822 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-51318222017-06-01 Improved assembly of noisy long reads by k-mer validation Carvalho, Antonio Bernardo Dupim, Eduardo G. Goldstein, Gabriel Genome Res Method Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively. Cold Spring Harbor Laboratory Press 2016-12 /pmc/articles/PMC5131822/ /pubmed/27831497 http://dx.doi.org/10.1101/gr.209247.116 Text en © 2016 Carvalho et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/. |
spellingShingle | Method Carvalho, Antonio Bernardo Dupim, Eduardo G. Goldstein, Gabriel Improved assembly of noisy long reads by k-mer validation |
title | Improved assembly of noisy long reads by k-mer validation |
title_full | Improved assembly of noisy long reads by k-mer validation |
title_fullStr | Improved assembly of noisy long reads by k-mer validation |
title_full_unstemmed | Improved assembly of noisy long reads by k-mer validation |
title_short | Improved assembly of noisy long reads by k-mer validation |
title_sort | improved assembly of noisy long reads by k-mer validation |
topic | Method |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5131822/ https://www.ncbi.nlm.nih.gov/pubmed/27831497 http://dx.doi.org/10.1101/gr.209247.116 |
work_keys_str_mv | AT carvalhoantoniobernardo improvedassemblyofnoisylongreadsbykmervalidation AT dupimeduardog improvedassemblyofnoisylongreadsbykmervalidation AT goldsteingabriel improvedassemblyofnoisylongreadsbykmervalidation |