Cargando…

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques f...

Descripción completa

Detalles Bibliográficos
Autor principal: Wu, Thomas D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842304/
https://www.ncbi.nlm.nih.gov/pubmed/27110277
http://dx.doi.org/10.1186/s13015-016-0068-6
_version_ 1782428500137869312
author Wu, Thomas D.
author_facet Wu, Thomas D.
author_sort Wu, Thomas D.
collection PubMed
description BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. RESULTS: We evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search. CONCLUSIONS: The combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4842304
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-48423042016-04-25 Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays Wu, Thomas D. Algorithms Mol Biol Research BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. RESULTS: We evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search. CONCLUSIONS: The combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users. BioMed Central 2016-04-23 /pmc/articles/PMC4842304/ /pubmed/27110277 http://dx.doi.org/10.1186/s13015-016-0068-6 Text en © Wu. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Wu, Thomas D.
Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_full Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_fullStr Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_full_unstemmed Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_short Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_sort bitpacking techniques for indexing genomes: ii. enhanced suffix arrays
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842304/
https://www.ncbi.nlm.nih.gov/pubmed/27110277
http://dx.doi.org/10.1186/s13015-016-0068-6
work_keys_str_mv AT wuthomasd bitpackingtechniquesforindexinggenomesiienhancedsuffixarrays