Cargando…

Fast lossless compression via cascading Bloom filters

BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to ke...

Descripción completa

Detalles Bibliográficos
Autores principales: Rozov, Roye, Shamir, Ron, Halperin, Eran
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168706/
https://www.ncbi.nlm.nih.gov/pubmed/25252952
http://dx.doi.org/10.1186/1471-2105-15-S9-S7
_version_ 1782335602982649856
author Rozov, Roye
Shamir, Ron
Halperin, Eran
author_facet Rozov, Roye
Shamir, Ron
Halperin, Eran
author_sort Rozov, Roye
collection PubMed
description BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes. RESULTS: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters. CONCLUSIONS: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.
format Online
Article
Text
id pubmed-4168706
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-41687062014-10-02 Fast lossless compression via cascading Bloom filters Rozov, Roye Shamir, Ron Halperin, Eran BMC Bioinformatics Proceedings BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes. RESULTS: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters. CONCLUSIONS: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly. BioMed Central 2014-09-10 /pmc/articles/PMC4168706/ /pubmed/25252952 http://dx.doi.org/10.1186/1471-2105-15-S9-S7 Text en Copyright © 2014 Rozov et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Rozov, Roye
Shamir, Ron
Halperin, Eran
Fast lossless compression via cascading Bloom filters
title Fast lossless compression via cascading Bloom filters
title_full Fast lossless compression via cascading Bloom filters
title_fullStr Fast lossless compression via cascading Bloom filters
title_full_unstemmed Fast lossless compression via cascading Bloom filters
title_short Fast lossless compression via cascading Bloom filters
title_sort fast lossless compression via cascading bloom filters
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168706/
https://www.ncbi.nlm.nih.gov/pubmed/25252952
http://dx.doi.org/10.1186/1471-2105-15-S9-S7
work_keys_str_mv AT rozovroye fastlosslesscompressionviacascadingbloomfilters
AT shamirron fastlosslesscompressionviacascadingbloomfilters
AT halperineran fastlosslesscompressionviacascadingbloomfilters