Cargando…

A space and time-efficient index for the compacted colored de Bruijn graph

MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix arra...

Descripción completa

Detalles Bibliográficos
Autores principales:	Almodaresi, Fatemeh, Sarkar, Hirak, Srivastava, Avi, Patro, Rob
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Ismb 2018–Intelligent Systems for Molecular Biology Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022659/ https://www.ncbi.nlm.nih.gov/pubmed/29949982 http://dx.doi.org/10.1093/bioinformatics/bty292

_version_	1783335725469532160
author	Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob
author_facet	Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob
author_sort	Almodaresi, Fatemeh
collection	PubMed
description	MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. RESULTS: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. AVAILABILITY AND IMPLEMENTATION: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6022659
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-60226592018-07-10 A space and time-efficient index for the compacted colored de Bruijn graph Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob Bioinformatics Ismb 2018–Intelligent Systems for Molecular Biology Proceedings MOTIVATION: Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. RESULTS: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. AVAILABILITY AND IMPLEMENTATION: pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-07-01 2018-06-27 /pmc/articles/PMC6022659/ /pubmed/29949982 http://dx.doi.org/10.1093/bioinformatics/bty292 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Ismb 2018–Intelligent Systems for Molecular Biology Proceedings Almodaresi, Fatemeh Sarkar, Hirak Srivastava, Avi Patro, Rob A space and time-efficient index for the compacted colored de Bruijn graph
title	A space and time-efficient index for the compacted colored de Bruijn graph
title_full	A space and time-efficient index for the compacted colored de Bruijn graph
title_fullStr	A space and time-efficient index for the compacted colored de Bruijn graph
title_full_unstemmed	A space and time-efficient index for the compacted colored de Bruijn graph
title_short	A space and time-efficient index for the compacted colored de Bruijn graph
title_sort	space and time-efficient index for the compacted colored de bruijn graph
topic	Ismb 2018–Intelligent Systems for Molecular Biology Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022659/ https://www.ncbi.nlm.nih.gov/pubmed/29949982 http://dx.doi.org/10.1093/bioinformatics/bty292
work_keys_str_mv	AT almodaresifatemeh aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT sarkarhirak aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT srivastavaavi aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT patrorob aspaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT almodaresifatemeh spaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT sarkarhirak spaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT srivastavaavi spaceandtimeefficientindexforthecompactedcoloreddebruijngraph AT patrorob spaceandtimeefficientindexforthecompactedcoloreddebruijngraph

A space and time-efficient index for the compacted colored de Bruijn graph

Ejemplares similares