Cargando…

LCQS: an efficient lossless compression tool of quality scores with random access functionality

BACKGROUND: Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality scor...

Descripción completa

Detalles Bibliográficos
Autores principales: Fu, Jiabing, Ke, Bixin, Dong, Shoubin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7079445/
https://www.ncbi.nlm.nih.gov/pubmed/32183707
http://dx.doi.org/10.1186/s12859-020-3428-7
_version_ 1783507825300865024
author Fu, Jiabing
Ke, Bixin
Dong, Shoubin
author_facet Fu, Jiabing
Ke, Bixin
Dong, Shoubin
author_sort Fu, Jiabing
collection PubMed
description BACKGROUND: Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. RESULTS: In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. CONCLUSION: The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.
format Online
Article
Text
id pubmed-7079445
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-70794452020-03-23 LCQS: an efficient lossless compression tool of quality scores with random access functionality Fu, Jiabing Ke, Bixin Dong, Shoubin BMC Bioinformatics Software BACKGROUND: Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. RESULTS: In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. CONCLUSION: The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage. BioMed Central 2020-03-18 /pmc/articles/PMC7079445/ /pubmed/32183707 http://dx.doi.org/10.1186/s12859-020-3428-7 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Fu, Jiabing
Ke, Bixin
Dong, Shoubin
LCQS: an efficient lossless compression tool of quality scores with random access functionality
title LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_full LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_fullStr LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_full_unstemmed LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_short LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_sort lcqs: an efficient lossless compression tool of quality scores with random access functionality
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7079445/
https://www.ncbi.nlm.nih.gov/pubmed/32183707
http://dx.doi.org/10.1186/s12859-020-3428-7
work_keys_str_mv AT fujiabing lcqsanefficientlosslesscompressiontoolofqualityscoreswithrandomaccessfunctionality
AT kebixin lcqsanefficientlosslesscompressiontoolofqualityscoreswithrandomaccessfunctionality
AT dongshoubin lcqsanefficientlosslesscompressiontoolofqualityscoreswithrandomaccessfunctionality