Cargando…

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplica...

Descripción completa

Detalles Bibliográficos
Autores principales: Sun, Hui, Zheng, Yingfeng, Xie, Haonan, Ma, Huidong, Liu, Xiaoguang, Wang, Gang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691058/
https://www.ncbi.nlm.nih.gov/pubmed/38036969
http://dx.doi.org/10.1186/s12859-023-05566-9
_version_ 1785152659598082048
author Sun, Hui
Zheng, Yingfeng
Xie, Haonan
Ma, Huidong
Liu, Xiaoguang
Wang, Gang
author_facet Sun, Hui
Zheng, Yingfeng
Xie, Haonan
Ma, Huidong
Liu, Xiaoguang
Wang, Gang
author_sort Sun, Hui
collection PubMed
description BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS: We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS: PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05566-9.
format Online
Article
Text
id pubmed-10691058
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-106910582023-12-02 PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang BMC Bioinformatics Software BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS: We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS: PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05566-9. BioMed Central 2023-11-30 /pmc/articles/PMC10691058/ /pubmed/38036969 http://dx.doi.org/10.1186/s12859-023-05566-9 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Software
Sun, Hui
Zheng, Yingfeng
Xie, Haonan
Ma, Huidong
Liu, Xiaoguang
Wang, Gang
PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_full PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_fullStr PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_full_unstemmed PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_short PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_sort pmffrc: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691058/
https://www.ncbi.nlm.nih.gov/pubmed/38036969
http://dx.doi.org/10.1186/s12859-023-05566-9
work_keys_str_mv AT sunhui pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering
AT zhengyingfeng pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering
AT xiehaonan pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering
AT mahuidong pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering
AT liuxiaoguang pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering
AT wanggang pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering