Cargando…

A space and time-efficient index for the compacted colored de Bruijn graph

MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix arra...

Descripción completa

Detalles Bibliográficos
Autores principales: Almodaresi, Fatemeh, Sarkar, Hirak, Srivastava, Avi, Patro, Rob
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022659/
https://www.ncbi.nlm.nih.gov/pubmed/29949982
http://dx.doi.org/10.1093/bioinformatics/bty292
_version_ 1783335725469532160
author Almodaresi, Fatemeh
Sarkar, Hirak
Srivastava, Avi
Patro, Rob
author_facet Almodaresi, Fatemeh
Sarkar, Hirak
Srivastava, Avi
Patro, Rob
author_sort Almodaresi, Fatemeh
collection PubMed
description MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. RESULTS: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. AVAILABILITY AND IMPLEMENTATION: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6022659
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-60226592018-07-10 A space and time-efficient index for the compacted colored de Bruijn graph Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob Bioinformatics Ismb 2018–Intelligent Systems for Molecular Biology Proceedings MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. RESULTS: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. AVAILABILITY AND IMPLEMENTATION: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-07-01 2018-06-27 /pmc/articles/PMC6022659/ /pubmed/29949982 http://dx.doi.org/10.1093/bioinformatics/bty292 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb 2018–Intelligent Systems for Molecular Biology Proceedings
Almodaresi, Fatemeh
Sarkar, Hirak
Srivastava, Avi
Patro, Rob
A space and time-efficient index for the compacted colored de Bruijn graph
title A space and time-efficient index for the compacted colored de Bruijn graph
title_full A space and time-efficient index for the compacted colored de Bruijn graph
title_fullStr A space and time-efficient index for the compacted colored de Bruijn graph
title_full_unstemmed A space and time-efficient index for the compacted colored de Bruijn graph
title_short A space and time-efficient index for the compacted colored de Bruijn graph
title_sort space and time-efficient index for the compacted colored de bruijn graph
topic Ismb 2018–Intelligent Systems for Molecular Biology Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022659/
https://www.ncbi.nlm.nih.gov/pubmed/29949982
http://dx.doi.org/10.1093/bioinformatics/bty292
work_keys_str_mv AT almodaresifatemeh aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph
AT sarkarhirak aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph
AT srivastavaavi aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph
AT patrorob aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph
AT almodaresifatemeh spaceandtimeefficientindexforthecompactedcoloreddebruijngraph
AT sarkarhirak spaceandtimeefficientindexforthecompactedcoloreddebruijngraph
AT srivastavaavi spaceandtimeefficientindexforthecompactedcoloreddebruijngraph
AT patrorob spaceandtimeefficientindexforthecompactedcoloreddebruijngraph