Cargando…

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

BACKGROUND: High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection...

Descripción completa

Detalles Bibliográficos
Autores principales:	Holley, Guillaume, Wittler, Roland, Stoye, Jens
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4832552/ https://www.ncbi.nlm.nih.gov/pubmed/27087830 http://dx.doi.org/10.1186/s13015-016-0066-8

_version_	1782427275199774720
author	Holley, Guillaume Wittler, Roland Stoye, Jens
author_facet	Holley, Guillaume Wittler, Roland Stoye, Jens
author_sort	Holley, Guillaume
collection	PubMed
description	BACKGROUND: High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. RESULTS: In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory. CONCLUSION: We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. AVAILABILITY: https://www.github.com/GuillaumeHolley/BloomFilterTrie.
format	Online Article Text
id	pubmed-4832552
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-48325522016-04-16 Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage Holley, Guillaume Wittler, Roland Stoye, Jens Algorithms Mol Biol Research BACKGROUND: High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences “colored” by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. RESULTS: In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52–66 times faster while using about 5.5–14.3 times less memory. CONCLUSION: We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. AVAILABILITY: https://www.github.com/GuillaumeHolley/BloomFilterTrie. BioMed Central 2016-04-14 /pmc/articles/PMC4832552/ /pubmed/27087830 http://dx.doi.org/10.1186/s13015-016-0066-8 Text en © Holley et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Holley, Guillaume Wittler, Roland Stoye, Jens Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
title	Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
title_full	Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
title_fullStr	Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
title_full_unstemmed	Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
title_short	Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage
title_sort	bloom filter trie: an alignment-free and reference-free data structure for pan-genome storage
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4832552/ https://www.ncbi.nlm.nih.gov/pubmed/27087830 http://dx.doi.org/10.1186/s13015-016-0066-8
work_keys_str_mv	AT holleyguillaume bloomfiltertrieanalignmentfreeandreferencefreedatastructureforpangenomestorage AT wittlerroland bloomfiltertrieanalignmentfreeandreferencefreedatastructureforpangenomestorage AT stoyejens bloomfiltertrieanalignmentfreeandreferencefreedatastructureforpangenomestorage

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage

Ejemplares similares