Cargando…
MZPAQ: a FASTQ data compression tool
BACKGROUND: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and anal...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547476/ https://www.ncbi.nlm.nih.gov/pubmed/31171931 http://dx.doi.org/10.1186/s13029-019-0073-5 |
_version_ | 1783423684992565248 |
---|---|
author | El Allali, Achraf Arshad, Mariam |
author_facet | El Allali, Achraf Arshad, Mariam |
author_sort | El Allali, Achraf |
collection | PubMed |
description | BACKGROUND: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers. RESULTS: In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data. CONCLUSIONS: Currently, MZPAQ’s strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization. |
format | Online Article Text |
id | pubmed-6547476 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-65474762019-06-06 MZPAQ: a FASTQ data compression tool El Allali, Achraf Arshad, Mariam Source Code Biol Med Methodology BACKGROUND: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers. RESULTS: In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data. CONCLUSIONS: Currently, MZPAQ’s strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization. BioMed Central 2019-06-03 /pmc/articles/PMC6547476/ /pubmed/31171931 http://dx.doi.org/10.1186/s13029-019-0073-5 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology El Allali, Achraf Arshad, Mariam MZPAQ: a FASTQ data compression tool |
title | MZPAQ: a FASTQ data compression tool |
title_full | MZPAQ: a FASTQ data compression tool |
title_fullStr | MZPAQ: a FASTQ data compression tool |
title_full_unstemmed | MZPAQ: a FASTQ data compression tool |
title_short | MZPAQ: a FASTQ data compression tool |
title_sort | mzpaq: a fastq data compression tool |
topic | Methodology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547476/ https://www.ncbi.nlm.nih.gov/pubmed/31171931 http://dx.doi.org/10.1186/s13029-019-0073-5 |
work_keys_str_mv | AT elallaliachraf mzpaqafastqdatacompressiontool AT arshadmariam mzpaqafastqdatacompressiontool |