Cargando…
ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodo...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6011487/ https://www.ncbi.nlm.nih.gov/pubmed/29925315 http://dx.doi.org/10.1186/s12859-018-2243-x |
_version_ | 1783333805296189440 |
---|---|
author | Coombe, Lauren Zhang, Jessica Vandervalk, Benjamin P. Chu, Justin Jackman, Shaun D. Birol, Inanc Warren, René L. |
author_facet | Coombe, Lauren Zhang, Jessica Vandervalk, Benjamin P. Chu, Justin Jackman, Shaun D. Birol, Inanc Warren, René L. |
author_sort | Coombe, Lauren |
collection | PubMed |
description | BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2243-x) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6011487 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-60114872018-07-05 ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers Coombe, Lauren Zhang, Jessica Vandervalk, Benjamin P. Chu, Justin Jackman, Shaun D. Birol, Inanc Warren, René L. BMC Bioinformatics Software BACKGROUND: The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS: Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS: ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2243-x) contains supplementary material, which is available to authorized users. BioMed Central 2018-06-20 /pmc/articles/PMC6011487/ /pubmed/29925315 http://dx.doi.org/10.1186/s12859-018-2243-x Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Coombe, Lauren Zhang, Jessica Vandervalk, Benjamin P. Chu, Justin Jackman, Shaun D. Birol, Inanc Warren, René L. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers |
title | ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers |
title_full | ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers |
title_fullStr | ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers |
title_full_unstemmed | ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers |
title_short | ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers |
title_sort | arks: chromosome-scale scaffolding of human genome drafts with linked read kmers |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6011487/ https://www.ncbi.nlm.nih.gov/pubmed/29925315 http://dx.doi.org/10.1186/s12859-018-2243-x |
work_keys_str_mv | AT coombelauren arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers AT zhangjessica arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers AT vandervalkbenjaminp arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers AT chujustin arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers AT jackmanshaund arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers AT birolinanc arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers AT warrenrenel arkschromosomescalescaffoldingofhumangenomedraftswithlinkedreadkmers |