Cargando…
Fast lossless compression via cascading Bloom filters
BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to ke...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168706/ https://www.ncbi.nlm.nih.gov/pubmed/25252952 http://dx.doi.org/10.1186/1471-2105-15-S9-S7 |
_version_ | 1782335602982649856 |
---|---|
author | Rozov, Roye Shamir, Ron Halperin, Eran |
author_facet | Rozov, Roye Shamir, Ron Halperin, Eran |
author_sort | Rozov, Roye |
collection | PubMed |
description | BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes. RESULTS: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters. CONCLUSIONS: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly. |
format | Online Article Text |
id | pubmed-4168706 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-41687062014-10-02 Fast lossless compression via cascading Bloom filters Rozov, Roye Shamir, Ron Halperin, Eran BMC Bioinformatics Proceedings BACKGROUND: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes. RESULTS: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters. CONCLUSIONS: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly. BioMed Central 2014-09-10 /pmc/articles/PMC4168706/ /pubmed/25252952 http://dx.doi.org/10.1186/1471-2105-15-S9-S7 Text en Copyright © 2014 Rozov et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Proceedings Rozov, Roye Shamir, Ron Halperin, Eran Fast lossless compression via cascading Bloom filters |
title | Fast lossless compression via cascading Bloom filters |
title_full | Fast lossless compression via cascading Bloom filters |
title_fullStr | Fast lossless compression via cascading Bloom filters |
title_full_unstemmed | Fast lossless compression via cascading Bloom filters |
title_short | Fast lossless compression via cascading Bloom filters |
title_sort | fast lossless compression via cascading bloom filters |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168706/ https://www.ncbi.nlm.nih.gov/pubmed/25252952 http://dx.doi.org/10.1186/1471-2105-15-S9-S7 |
work_keys_str_mv | AT rozovroye fastlosslesscompressionviacascadingbloomfilters AT shamirron fastlosslesscompressionviacascadingbloomfilters AT halperineran fastlosslesscompressionviacascadingbloomfilters |