Cargando…

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplica...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sun, Hui, Zheng, Yingfeng, Xie, Haonan, Ma, Huidong, Liu, Xiaoguang, Wang, Gang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2023
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691058/ https://www.ncbi.nlm.nih.gov/pubmed/38036969 http://dx.doi.org/10.1186/s12859-023-05566-9

_version_	1785152659598082048
author	Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang
author_facet	Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang
author_sort	Sun, Hui
collection	PubMed
description	BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS: We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS: PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05566-9.
format	Online Article Text
id	pubmed-10691058
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-106910582023-12-02 PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang BMC Bioinformatics Software BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS: We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS: PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05566-9. BioMed Central 2023-11-30 /pmc/articles/PMC10691058/ /pubmed/38036969 http://dx.doi.org/10.1186/s12859-023-05566-9 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Software Sun, Hui Zheng, Yingfeng Xie, Haonan Ma, Huidong Liu, Xiaoguang Wang, Gang PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title	PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_full	PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_fullStr	PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_full_unstemmed	PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_short	PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
title_sort	pmffrc: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10691058/ https://www.ncbi.nlm.nih.gov/pubmed/38036969 http://dx.doi.org/10.1186/s12859-023-05566-9
work_keys_str_mv	AT sunhui pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT zhengyingfeng pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT xiehaonan pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT mahuidong pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT liuxiaoguang pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering AT wanggang pmffrcalargescalegenomicshortreadscompressionoptimizerviamemorymodelingandredundantclustering

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

Ejemplares similares