Cargando…

Fast randomized approximate string matching with succinct hash data structures

BACKGROUND: The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used...

Descripción completa

Detalles Bibliográficos
Autores principales:	Policriti, Alberto, Prezza, Nicola
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4464037/ https://www.ncbi.nlm.nih.gov/pubmed/26051265 http://dx.doi.org/10.1186/1471-2105-16-S9-S4

_version_	1782375879551221760
author	Policriti, Alberto Prezza, Nicola
author_facet	Policriti, Alberto Prezza, Nicola
author_sort	Policriti, Alberto
collection	PubMed
description	BACKGROUND: The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. METHODS: In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. RESULTS: The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). CONCLUSIONS: In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes.
format	Online Article Text
id	pubmed-4464037
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44640372015-06-29 Fast randomized approximate string matching with succinct hash data structures Policriti, Alberto Prezza, Nicola BMC Bioinformatics Research BACKGROUND: The high throughput of modern NGS sequencers coupled with the huge sizes of genomes currently analysed, poses always higher algorithmic challenges to align short reads quickly and accurately against a reference sequence. A crucial, additional, requirement is that the data structures used should be light. The available modern solutions usually are a compromise between the mentioned constraints: in particular, indexes based on the Burrows-Wheeler transform offer reduced memory requirements at the price of lower sensitivity, while hash-based text indexes guarantee high sensitivity at the price of significant memory consumption. METHODS: In this work we describe a technique that permits to attain the advantages granted by both classes of indexes. This is achieved using Hamming-aware hash functions--hash functions designed to search the entire Hamming sphere in reduced time--which are also homomorphisms on de Bruijn graphs. We show that, using this particular class of hash functions, the corresponding hash index can be represented in linear space introducing only a logarithmic slowdown (in the query length) for the lookup operation. We point out that our data structure reaches its goals without compressing its input: another positive feature, as in biological applications data is often very close to be un-compressible. RESULTS: The new data structure introduced in this work is called dB-hash and we show how its implementation--BW-ERNE--maintains the high sensitivity and speed of its (hash-based) predecessor ERNE, while drastically reducing space consumption. Extensive comparison experiments conducted with several popular alignment tools on both simulated and real NGS data, show, finally, that BW-ERNE is able to attain both the positive features of succinct data structures (that is, small space) and hash indexes (that is, sensitivity). CONCLUSIONS: In applications where space and speed are both a concern, standard methods often sacrifice accuracy to obtain competitive throughputs and memory footprints. In this work we show that, combining hashing and succinct indexing techniques, we can attain good performances and accuracy with a memory footprint comparable to that of the most popular compressed indexes. BioMed Central 2015-06-01 /pmc/articles/PMC4464037/ /pubmed/26051265 http://dx.doi.org/10.1186/1471-2105-16-S9-S4 Text en Copyright © 2015 Policriti and Prezza et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Policriti, Alberto Prezza, Nicola Fast randomized approximate string matching with succinct hash data structures
title	Fast randomized approximate string matching with succinct hash data structures
title_full	Fast randomized approximate string matching with succinct hash data structures
title_fullStr	Fast randomized approximate string matching with succinct hash data structures
title_full_unstemmed	Fast randomized approximate string matching with succinct hash data structures
title_short	Fast randomized approximate string matching with succinct hash data structures
title_sort	fast randomized approximate string matching with succinct hash data structures
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4464037/ https://www.ncbi.nlm.nih.gov/pubmed/26051265 http://dx.doi.org/10.1186/1471-2105-16-S9-S4
work_keys_str_mv	AT policritialberto fastrandomizedapproximatestringmatchingwithsuccincthashdatastructures AT prezzanicola fastrandomizedapproximatestringmatchingwithsuccincthashdatastructures

Fast randomized approximate string matching with succinct hash data structures

Ejemplares similares