Cargando…
Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques f...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842304/ https://www.ncbi.nlm.nih.gov/pubmed/27110277 http://dx.doi.org/10.1186/s13015-016-0068-6 |
_version_ | 1782428500137869312 |
---|---|
author | Wu, Thomas D. |
author_facet | Wu, Thomas D. |
author_sort | Wu, Thomas D. |
collection | PubMed |
description | BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. RESULTS: We evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search. CONCLUSIONS: The combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4842304 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-48423042016-04-25 Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays Wu, Thomas D. Algorithms Mol Biol Research BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. RESULTS: We evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search. CONCLUSIONS: The combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users. BioMed Central 2016-04-23 /pmc/articles/PMC4842304/ /pubmed/27110277 http://dx.doi.org/10.1186/s13015-016-0068-6 Text en © Wu. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Wu, Thomas D. Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays |
title | Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays |
title_full | Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays |
title_fullStr | Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays |
title_full_unstemmed | Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays |
title_short | Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays |
title_sort | bitpacking techniques for indexing genomes: ii. enhanced suffix arrays |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842304/ https://www.ncbi.nlm.nih.gov/pubmed/27110277 http://dx.doi.org/10.1186/s13015-016-0068-6 |
work_keys_str_mv | AT wuthomasd bitpackingtechniquesforindexinggenomesiienhancedsuffixarrays |