Cargando…

QualComp: a new lossy compressor for quality scores based on rate distortion theory

BACKGROUND: Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of dis...

Descripción completa

Detalles Bibliográficos
Autores principales: Ochoa, Idoia, Asnani, Himanshu, Bharadia, Dinesh, Chowdhury, Mainak, Weissman, Tsachy, Yona, Golan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698011/
https://www.ncbi.nlm.nih.gov/pubmed/23758828
http://dx.doi.org/10.1186/1471-2105-14-187
_version_ 1782275223139123200
author Ochoa, Idoia
Asnani, Himanshu
Bharadia, Dinesh
Chowdhury, Mainak
Weissman, Tsachy
Yona, Golan
author_facet Ochoa, Idoia
Asnani, Himanshu
Bharadia, Dinesh
Chowdhury, Mainak
Weissman, Tsachy
Yona, Golan
author_sort Ochoa, Idoia
collection PubMed
description BACKGROUND: Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. RESULTS: In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment). CONCLUSIONS: QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp.
format Online
Article
Text
id pubmed-3698011
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36980112013-07-02 QualComp: a new lossy compressor for quality scores based on rate distortion theory Ochoa, Idoia Asnani, Himanshu Bharadia, Dinesh Chowdhury, Mainak Weissman, Tsachy Yona, Golan BMC Bioinformatics Methodology Article BACKGROUND: Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. RESULTS: In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment). CONCLUSIONS: QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp. BioMed Central 2013-06-08 /pmc/articles/PMC3698011/ /pubmed/23758828 http://dx.doi.org/10.1186/1471-2105-14-187 Text en Copyright ©2013 Ochoa et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Ochoa, Idoia
Asnani, Himanshu
Bharadia, Dinesh
Chowdhury, Mainak
Weissman, Tsachy
Yona, Golan
QualComp: a new lossy compressor for quality scores based on rate distortion theory
title QualComp: a new lossy compressor for quality scores based on rate distortion theory
title_full QualComp: a new lossy compressor for quality scores based on rate distortion theory
title_fullStr QualComp: a new lossy compressor for quality scores based on rate distortion theory
title_full_unstemmed QualComp: a new lossy compressor for quality scores based on rate distortion theory
title_short QualComp: a new lossy compressor for quality scores based on rate distortion theory
title_sort qualcomp: a new lossy compressor for quality scores based on rate distortion theory
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698011/
https://www.ncbi.nlm.nih.gov/pubmed/23758828
http://dx.doi.org/10.1186/1471-2105-14-187
work_keys_str_mv AT ochoaidoia qualcompanewlossycompressorforqualityscoresbasedonratedistortiontheory
AT asnanihimanshu qualcompanewlossycompressorforqualityscoresbasedonratedistortiontheory
AT bharadiadinesh qualcompanewlossycompressorforqualityscoresbasedonratedistortiontheory
AT chowdhurymainak qualcompanewlossycompressorforqualityscoresbasedonratedistortiontheory
AT weissmantsachy qualcompanewlossycompressorforqualityscoresbasedonratedistortiontheory
AT yonagolan qualcompanewlossycompressorforqualityscoresbasedonratedistortiontheory