Cargando…
PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplica...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691058/ https://www.ncbi.nlm.nih.gov/pubmed/38036969 http://dx.doi.org/10.1186/s12859-023-05566-9 |
_version_ | 1785152659598082048 |
---|---|
author | Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang |
author_facet | Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang |
author_sort | Sun, Hui |
collection | PubMed |
description | BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS: We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS: PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05566-9. |
format | Online Article Text |
id | pubmed-10691058 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-106910582023-12-02 PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang BMC Bioinformatics Software BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS: We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS: PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05566-9. BioMed Central 2023-11-30 /pmc/articles/PMC10691058/ /pubmed/38036969 http://dx.doi.org/10.1186/s12859-023-05566-9 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Software Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering |
title | PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering |
title_full | PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering |
title_fullStr | PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering |
title_full_unstemmed | PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering |
title_short | PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering |
title_sort | pmffrc: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691058/ https://www.ncbi.nlm.nih.gov/pubmed/38036969 http://dx.doi.org/10.1186/s12859-023-05566-9 |
work_keys_str_mv | AT sunhui pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT zhengyingfeng pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT xiehaonan pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT mahuidong pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT liuxiaoguang pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT wanggang pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering |