Cargando…
A space and time-efficient index for the compacted colored de Bruijn graph
MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix arra...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022659/ https://www.ncbi.nlm.nih.gov/pubmed/29949982 http://dx.doi.org/10.1093/bioinformatics/bty292 |
_version_ | 1783335725469532160 |
---|---|
author | Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob |
author_facet | Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob |
author_sort | Almodaresi, Fatemeh |
collection | PubMed |
description | MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. RESULTS: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. AVAILABILITY AND IMPLEMENTATION: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-6022659 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-60226592018-07-10 A space and time-efficient index for the compacted colored de Bruijn graph Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob Bioinformatics Ismb 2018–Intelligent Systems for Molecular Biology Proceedings MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. RESULTS: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. AVAILABILITY AND IMPLEMENTATION: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-07-01 2018-06-27 /pmc/articles/PMC6022659/ /pubmed/29949982 http://dx.doi.org/10.1093/bioinformatics/bty292 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Ismb 2018–Intelligent Systems for Molecular Biology Proceedings Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob A space and time-efficient index for the compacted colored de Bruijn graph |
title | A space and time-efficient index for the compacted colored de Bruijn graph |
title_full | A space and time-efficient index for the compacted colored de Bruijn graph |
title_fullStr | A space and time-efficient index for the compacted colored de Bruijn graph |
title_full_unstemmed | A space and time-efficient index for the compacted colored de Bruijn graph |
title_short | A space and time-efficient index for the compacted colored de Bruijn graph |
title_sort | space and time-efficient index for the compacted colored de bruijn graph |
topic | Ismb 2018–Intelligent Systems for Molecular Biology Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022659/ https://www.ncbi.nlm.nih.gov/pubmed/29949982 http://dx.doi.org/10.1093/bioinformatics/bty292 |
work_keys_str_mv | AT almodaresifatemeh aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT sarkarhirak aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT srivastavaavi aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT patrorob aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT almodaresifatemeh spaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT sarkarhirak spaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT srivastavaavi spaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT patrorob spaceandtimeefficientindexforthecompactedcoloreddebruijngraph |