Cargando…
Bitpacking techniques for indexing genomes: I. Hash tables
BACKGROUND: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4835851/ https://www.ncbi.nlm.nih.gov/pubmed/27095998 http://dx.doi.org/10.1186/s13015-016-0069-5 |
_version_ | 1782427682777071616 |
---|---|
author | Wu, Thomas D. |
author_facet | Wu, Thomas D. |
author_sort | Wu, Thomas D. |
collection | PubMed |
description | BACKGROUND: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome. RESULTS: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values. CONCLUSIONS: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0069-5) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4835851 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-48358512016-04-20 Bitpacking techniques for indexing genomes: I. Hash tables Wu, Thomas D. Algorithms Mol Biol Research BACKGROUND: Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome. RESULTS: We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values. CONCLUSIONS: Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0069-5) contains supplementary material, which is available to authorized users. BioMed Central 2016-04-18 /pmc/articles/PMC4835851/ /pubmed/27095998 http://dx.doi.org/10.1186/s13015-016-0069-5 Text en © Wu. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Wu, Thomas D. Bitpacking techniques for indexing genomes: I. Hash tables |
title | Bitpacking techniques for indexing genomes: I. Hash tables |
title_full | Bitpacking techniques for indexing genomes: I. Hash tables |
title_fullStr | Bitpacking techniques for indexing genomes: I. Hash tables |
title_full_unstemmed | Bitpacking techniques for indexing genomes: I. Hash tables |
title_short | Bitpacking techniques for indexing genomes: I. Hash tables |
title_sort | bitpacking techniques for indexing genomes: i. hash tables |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4835851/ https://www.ncbi.nlm.nih.gov/pubmed/27095998 http://dx.doi.org/10.1186/s13015-016-0069-5 |
work_keys_str_mv | AT wuthomasd bitpackingtechniquesforindexinggenomesihashtables |