Cargando…

Data-dependent bucketing improves reference-free compression of sequencing reads

Motivation: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. Result...

Descripción completa

Detalles Bibliográficos
Autores principales: Patro, Rob, Kingsford, Carl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547610/
https://www.ncbi.nlm.nih.gov/pubmed/25910696
http://dx.doi.org/10.1093/bioinformatics/btv248
_version_ 1782387083149574144
author Patro, Rob
Kingsford, Carl
author_facet Patro, Rob
Kingsford, Carl
author_sort Patro, Rob
collection PubMed
description Motivation: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. Results: We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. Availability and implementation: Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. Contact: carlk@cs.cmu.edu Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-4547610
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-45476102015-08-25 Data-dependent bucketing improves reference-free compression of sequencing reads Patro, Rob Kingsford, Carl Bioinformatics Original Papers Motivation: The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. Results: We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. Availability and implementation: Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. Contact: carlk@cs.cmu.edu Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2015-09-01 2015-04-24 /pmc/articles/PMC4547610/ /pubmed/25910696 http://dx.doi.org/10.1093/bioinformatics/btv248 Text en © The Author 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Patro, Rob
Kingsford, Carl
Data-dependent bucketing improves reference-free compression of sequencing reads
title Data-dependent bucketing improves reference-free compression of sequencing reads
title_full Data-dependent bucketing improves reference-free compression of sequencing reads
title_fullStr Data-dependent bucketing improves reference-free compression of sequencing reads
title_full_unstemmed Data-dependent bucketing improves reference-free compression of sequencing reads
title_short Data-dependent bucketing improves reference-free compression of sequencing reads
title_sort data-dependent bucketing improves reference-free compression of sequencing reads
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547610/
https://www.ncbi.nlm.nih.gov/pubmed/25910696
http://dx.doi.org/10.1093/bioinformatics/btv248
work_keys_str_mv AT patrorob datadependentbucketingimprovesreferencefreecompressionofsequencingreads
AT kingsfordcarl datadependentbucketingimprovesreferencefreecompressionofsequencingreads