Cargando…

MZPAQ: a FASTQ data compression tool

BACKGROUND: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and anal...

Descripción completa

Detalles Bibliográficos
Autores principales: El Allali, Achraf, Arshad, Mariam
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547476/
https://www.ncbi.nlm.nih.gov/pubmed/31171931
http://dx.doi.org/10.1186/s13029-019-0073-5
_version_ 1783423684992565248
author El Allali, Achraf
Arshad, Mariam
author_facet El Allali, Achraf
Arshad, Mariam
author_sort El Allali, Achraf
collection PubMed
description BACKGROUND: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers. RESULTS: In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data. CONCLUSIONS: Currently, MZPAQ’s strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.
format Online
Article
Text
id pubmed-6547476
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-65474762019-06-06 MZPAQ: a FASTQ data compression tool El Allali, Achraf Arshad, Mariam Source Code Biol Med Methodology BACKGROUND: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers. RESULTS: In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data. CONCLUSIONS: Currently, MZPAQ’s strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization. BioMed Central 2019-06-03 /pmc/articles/PMC6547476/ /pubmed/31171931 http://dx.doi.org/10.1186/s13029-019-0073-5 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
El Allali, Achraf
Arshad, Mariam
MZPAQ: a FASTQ data compression tool
title MZPAQ: a FASTQ data compression tool
title_full MZPAQ: a FASTQ data compression tool
title_fullStr MZPAQ: a FASTQ data compression tool
title_full_unstemmed MZPAQ: a FASTQ data compression tool
title_short MZPAQ: a FASTQ data compression tool
title_sort mzpaq: a fastq data compression tool
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547476/
https://www.ncbi.nlm.nih.gov/pubmed/31171931
http://dx.doi.org/10.1186/s13029-019-0073-5
work_keys_str_mv AT elallaliachraf mzpaqafastqdatacompressiontool
AT arshadmariam mzpaqafastqdatacompressiontool