Cargando…

Reference-based compression of short-read sequences using path encoding

Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing l...

Descripción completa

Detalles Bibliográficos
Autores principales: Kingsford, Carl, Patro, Rob
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481695/
https://www.ncbi.nlm.nih.gov/pubmed/25649622
http://dx.doi.org/10.1093/bioinformatics/btv071
_version_ 1782378309600935936
author Kingsford, Carl
Patro, Rob
author_facet Kingsford, Carl
Patro, Rob
author_sort Kingsford, Carl
collection PubMed
description Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X. Contact: carlk@cs.cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-4481695
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-44816952015-06-30 Reference-based compression of short-read sequences using path encoding Kingsford, Carl Patro, Rob Bioinformatics Original Papers Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X. Contact: carlk@cs.cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2015-06-15 2015-02-02 /pmc/articles/PMC4481695/ /pubmed/25649622 http://dx.doi.org/10.1093/bioinformatics/btv071 Text en © The Author 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Kingsford, Carl
Patro, Rob
Reference-based compression of short-read sequences using path encoding
title Reference-based compression of short-read sequences using path encoding
title_full Reference-based compression of short-read sequences using path encoding
title_fullStr Reference-based compression of short-read sequences using path encoding
title_full_unstemmed Reference-based compression of short-read sequences using path encoding
title_short Reference-based compression of short-read sequences using path encoding
title_sort reference-based compression of short-read sequences using path encoding
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481695/
https://www.ncbi.nlm.nih.gov/pubmed/25649622
http://dx.doi.org/10.1093/bioinformatics/btv071
work_keys_str_mv AT kingsfordcarl referencebasedcompressionofshortreadsequencesusingpathencoding
AT patrorob referencebasedcompressionofshortreadsequencesusingpathencoding