Cargando…

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques f...

Descripción completa

Detalles Bibliográficos
Autor principal:	Wu, Thomas D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842304/ https://www.ncbi.nlm.nih.gov/pubmed/27110277 http://dx.doi.org/10.1186/s13015-016-0068-6

_version_	1782428500137869312
author	Wu, Thomas D.
author_facet	Wu, Thomas D.
author_sort	Wu, Thomas D.
collection	PubMed
description	BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. RESULTS: We evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search. CONCLUSIONS: The combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4842304
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-48423042016-04-25 Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays Wu, Thomas D. Algorithms Mol Biol Research BACKGROUND: Suffix arrays and their variants are used widely for representing genomes in search applications. Enhanced suffix arrays (ESAs) provide fast search speed, but require large auxiliary data structures for storing longest common prefix and child interval information. We explore techniques for compressing ESAs to accelerate genomic search and reduce memory requirements. RESULTS: We evaluate various bitpacking techniques that store integers in fewer than 32 bits each, as well as bytecoding methods that reserve a single byte per integer whenever possible. Our results on the fly, chicken, and human genomes show that bytecoding with an exception guide array is the fastest method for retrieving auxiliary information. Genomic searching can be further accelerated using a data structure called a discriminating character array, which reduces memory accesses to the suffix array and the genome string. Finally, integrating storage of the auxiliary and discriminating character arrays further speeds up genomic search. CONCLUSIONS: The combination of exception guide arrays, a discriminating character array, and integrated data storage provide a 2- to 3-fold increase in speed for genomic searching compared with using bytecoding alone, and is 20 % faster and 40 % more space-efficient than an uncompressed ESA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13015-016-0068-6) contains supplementary material, which is available to authorized users. BioMed Central 2016-04-23 /pmc/articles/PMC4842304/ /pubmed/27110277 http://dx.doi.org/10.1186/s13015-016-0068-6 Text en © Wu. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Wu, Thomas D. Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title	Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_full	Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_fullStr	Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_full_unstemmed	Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_short	Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays
title_sort	bitpacking techniques for indexing genomes: ii. enhanced suffix arrays
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842304/ https://www.ncbi.nlm.nih.gov/pubmed/27110277 http://dx.doi.org/10.1186/s13015-016-0068-6
work_keys_str_mv	AT wuthomasd bitpackingtechniquesforindexinggenomesiienhancedsuffixarrays

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

Ejemplares similares