Cargando…

Integrating long-range connectivity information into de Bruijn graphs

MOTIVATION: The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genom...

Descripción completa

Detalles Bibliográficos
Autores principales: Turner, Isaac, Garimella, Kiran V, Iqbal, Zamin, McVean, Gil
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6061703/
https://www.ncbi.nlm.nih.gov/pubmed/29554215
http://dx.doi.org/10.1093/bioinformatics/bty157
_version_ 1783342276022370304
author Turner, Isaac
Garimella, Kiran V
Iqbal, Zamin
McVean, Gil
author_facet Turner, Isaac
Garimella, Kiran V
Iqbal, Zamin
McVean, Gil
author_sort Turner, Isaac
collection PubMed
description MOTIVATION: The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. RESULTS: We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. AVAILABILITY AND IMPLEMENTATION: Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6061703
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-60617032018-08-07 Integrating long-range connectivity information into de Bruijn graphs Turner, Isaac Garimella, Kiran V Iqbal, Zamin McVean, Gil Bioinformatics Original Papers MOTIVATION: The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. RESULTS: We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. AVAILABILITY AND IMPLEMENTATION: Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-08-01 2018-03-15 /pmc/articles/PMC6061703/ /pubmed/29554215 http://dx.doi.org/10.1093/bioinformatics/bty157 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Turner, Isaac
Garimella, Kiran V
Iqbal, Zamin
McVean, Gil
Integrating long-range connectivity information into de Bruijn graphs
title Integrating long-range connectivity information into de Bruijn graphs
title_full Integrating long-range connectivity information into de Bruijn graphs
title_fullStr Integrating long-range connectivity information into de Bruijn graphs
title_full_unstemmed Integrating long-range connectivity information into de Bruijn graphs
title_short Integrating long-range connectivity information into de Bruijn graphs
title_sort integrating long-range connectivity information into de bruijn graphs
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6061703/
https://www.ncbi.nlm.nih.gov/pubmed/29554215
http://dx.doi.org/10.1093/bioinformatics/bty157
work_keys_str_mv AT turnerisaac integratinglongrangeconnectivityinformationintodebruijngraphs
AT garimellakiranv integratinglongrangeconnectivityinformationintodebruijngraphs
AT iqbalzamin integratinglongrangeconnectivityinformationintodebruijngraphs
AT mcveangil integratinglongrangeconnectivityinformationintodebruijngraphs