Cargando…

Compacting de Bruijn graphs from sequencing data quickly and in low memory

Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction...

Descripción completa

Detalles Bibliográficos
Autores principales: Chikhi, Rayan, Limasset, Antoine, Medvedev, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908363/
https://www.ncbi.nlm.nih.gov/pubmed/27307618
http://dx.doi.org/10.1093/bioinformatics/btw279
_version_ 1782437668429234176
author Chikhi, Rayan
Limasset, Antoine
Medvedev, Paul
author_facet Chikhi, Rayan
Limasset, Antoine
Medvedev, Paul
author_sort Chikhi, Rayan
collection PubMed
description Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. Results: We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods. Availability and Implementation: Source code of bcalm 2 is freely available at: https://github.com/GATB/bcalm Contact: rayan.chikhi@univ-lille1.fr
format Online
Article
Text
id pubmed-4908363
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-49083632016-06-17 Compacting de Bruijn graphs from sequencing data quickly and in low memory Chikhi, Rayan Limasset, Antoine Medvedev, Paul Bioinformatics Ismb 2016 Proceedings July 8 to July 12, 2016, Orlando, Florida Motivation: As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem. Results: We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods. Availability and Implementation: Source code of bcalm 2 is freely available at: https://github.com/GATB/bcalm Contact: rayan.chikhi@univ-lille1.fr Oxford University Press 2016-06-15 2016-06-11 /pmc/articles/PMC4908363/ /pubmed/27307618 http://dx.doi.org/10.1093/bioinformatics/btw279 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb 2016 Proceedings July 8 to July 12, 2016, Orlando, Florida
Chikhi, Rayan
Limasset, Antoine
Medvedev, Paul
Compacting de Bruijn graphs from sequencing data quickly and in low memory
title Compacting de Bruijn graphs from sequencing data quickly and in low memory
title_full Compacting de Bruijn graphs from sequencing data quickly and in low memory
title_fullStr Compacting de Bruijn graphs from sequencing data quickly and in low memory
title_full_unstemmed Compacting de Bruijn graphs from sequencing data quickly and in low memory
title_short Compacting de Bruijn graphs from sequencing data quickly and in low memory
title_sort compacting de bruijn graphs from sequencing data quickly and in low memory
topic Ismb 2016 Proceedings July 8 to July 12, 2016, Orlando, Florida
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4908363/
https://www.ncbi.nlm.nih.gov/pubmed/27307618
http://dx.doi.org/10.1093/bioinformatics/btw279
work_keys_str_mv AT chikhirayan compactingdebruijngraphsfromsequencingdataquicklyandinlowmemory
AT limassetantoine compactingdebruijngraphsfromsequencingdataquicklyandinlowmemory
AT medvedevpaul compactingdebruijngraphsfromsequencingdataquicklyandinlowmemory