Cargando…

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections

MOTIVATION: The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Khan, Jamshed, Patro, Rob
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Genome Sequence Analysis
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8275350/ https://www.ncbi.nlm.nih.gov/pubmed/34252958 http://dx.doi.org/10.1093/bioinformatics/btab309

_version_	1783721695920521216
author	Khan, Jamshed Patro, Rob
author_facet	Khan, Jamshed Patro, Rob
author_sort	Khan, Jamshed
collection	PubMed
description	MOTIVATION: The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. RESULTS: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using [Formula: see text] 29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using [Formula: see text] 84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using [Formula: see text] 126 GB of memory, and over 16 h using [Formula: see text] 289 GB of memory, respectively. AVAILABILITY AND IMPLEMENTATION: Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-8275350
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-82753502021-07-13 Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections Khan, Jamshed Patro, Rob Bioinformatics Genome Sequence Analysis MOTIVATION: The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. RESULTS: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using [Formula: see text] 29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using [Formula: see text] 84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using [Formula: see text] 126 GB of memory, and over 16 h using [Formula: see text] 289 GB of memory, respectively. AVAILABILITY AND IMPLEMENTATION: Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-07-12 /pmc/articles/PMC8275350/ /pubmed/34252958 http://dx.doi.org/10.1093/bioinformatics/btab309 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Genome Sequence Analysis Khan, Jamshed Patro, Rob Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections
title	Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections
title_full	Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections
title_fullStr	Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections
title_full_unstemmed	Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections
title_short	Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections
title_sort	cuttlefish: fast, parallel and low-memory compaction of de bruijn graphs from large-scale genome collections
topic	Genome Sequence Analysis
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8275350/ https://www.ncbi.nlm.nih.gov/pubmed/34252958 http://dx.doi.org/10.1093/bioinformatics/btab309
work_keys_str_mv	AT khanjamshed cuttlefishfastparallelandlowmemorycompactionofdebruijngraphsfromlargescalegenomecollections AT patrorob cuttlefishfastparallelandlowmemorycompactionofdebruijngraphsfromlargescalegenomecollections

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections

Ejemplares similares