Cargando…

Fast lossless compression via cascading Bloom filters

BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to ke...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rozov, Roye, Shamir, Ron, Halperin, Eran
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168706/ https://www.ncbi.nlm.nih.gov/pubmed/25252952 http://dx.doi.org/10.1186/1471-2105-15-S9-S7

_version_	1782335602982649856
author	Rozov, Roye Shamir, Ron Halperin, Eran
author_facet	Rozov, Roye Shamir, Ron Halperin, Eran
author_sort	Rozov, Roye
collection	PubMed
description	BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes. RESULTS: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters. CONCLUSIONS: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.
format	Online Article Text
id	pubmed-4168706
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41687062014-10-02 Fast lossless compression via cascading Bloom filters Rozov, Roye Shamir, Ron Halperin, Eran BMC Bioinformatics Proceedings BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes. RESULTS: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters. CONCLUSIONS: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly. BioMed Central 2014-09-10 /pmc/articles/PMC4168706/ /pubmed/25252952 http://dx.doi.org/10.1186/1471-2105-15-S9-S7 Text en Copyright © 2014 Rozov et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Rozov, Roye Shamir, Ron Halperin, Eran Fast lossless compression via cascading Bloom filters
title	Fast lossless compression via cascading Bloom filters
title_full	Fast lossless compression via cascading Bloom filters
title_fullStr	Fast lossless compression via cascading Bloom filters
title_full_unstemmed	Fast lossless compression via cascading Bloom filters
title_short	Fast lossless compression via cascading Bloom filters
title_sort	fast lossless compression via cascading bloom filters
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168706/ https://www.ncbi.nlm.nih.gov/pubmed/25252952 http://dx.doi.org/10.1186/1471-2105-15-S9-S7
work_keys_str_mv	AT rozovroye fastlosslesscompressionviacascadingbloomfilters AT shamirron fastlosslesscompressionviacascadingbloomfilters AT halperineran fastlosslesscompressionviacascadingbloomfilters

Fast lossless compression via cascading Bloom filters

Ejemplares similares