Cargando…

ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers

BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodo...

Descripción completa

Detalles Bibliográficos
Autores principales: Coombe, Lauren, Zhang, Jessica, Vandervalk, Benjamin P., Chu, Justin, Jackman, Shaun D., Birol, Inanc, Warren, René L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6011487/
https://www.ncbi.nlm.nih.gov/pubmed/29925315
http://dx.doi.org/10.1186/s12859-018-2243-x
_version_ 1783333805296189440
author Coombe, Lauren
Zhang, Jessica
Vandervalk, Benjamin P.
Chu, Justin
Jackman, Shaun D.
Birol, Inanc
Warren, René L.
author_facet Coombe, Lauren
Zhang, Jessica
Vandervalk, Benjamin P.
Chu, Justin
Jackman, Shaun D.
Birol, Inanc
Warren, René L.
author_sort Coombe, Lauren
collection PubMed
description BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2243-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6011487
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60114872018-07-05 ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers Coombe, Lauren Zhang, Jessica Vandervalk, Benjamin P. Chu, Justin Jackman, Shaun D. Birol, Inanc Warren, René L. BMC Bioinformatics Software BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2243-x) contains supplementary material, which is available to authorized users. BioMed Central 2018-06-20 /pmc/articles/PMC6011487/ /pubmed/29925315 http://dx.doi.org/10.1186/s12859-018-2243-x Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Coombe, Lauren
Zhang, Jessica
Vandervalk, Benjamin P.
Chu, Justin
Jackman, Shaun D.
Birol, Inanc
Warren, René L.
ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
title ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
title_full ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
title_fullStr ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
title_full_unstemmed ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
title_short ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
title_sort arks: chromosome-scale scaffolding of human genome drafts with linked read kmers
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6011487/
https://www.ncbi.nlm.nih.gov/pubmed/29925315
http://dx.doi.org/10.1186/s12859-018-2243-x
work_keys_str_mv AT coombelauren arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers
AT zhangjessica arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers
AT vandervalkbenjaminp arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers
AT chujustin arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers
AT jackmanshaund arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers
AT birolinanc arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers
AT warrenrenel arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers