Cargando…

Bitpacking techniques for indexing genomes: I. Hash tables

BACKGROUND: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could...

Descripción completa

Detalles Bibliográficos
Autor principal: Wu, Thomas D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4835851/
https://www.ncbi.nlm.nih.gov/pubmed/27095998
http://dx.doi.org/10.1186/s13015-016-0069-5
Descripción
Sumario:BACKGROUND: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome. RESULTS: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values. CONCLUSIONS: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0069-5) contains supplementary material, which is available to authorized users.