Cargando…

Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity

Background: With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression and decompression according to the characteristi...

Descripción completa

Detalles Bibliográficos
Autores principales: Ding, Youde, Liao, Yuan, He, Ji, Ma, Jianfeng, Wei, Xu, Liu, Xuemei, Zhang, Guiying, Wang, Jing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267386/
https://www.ncbi.nlm.nih.gov/pubmed/37323665
http://dx.doi.org/10.3389/fgene.2023.1213907
_version_ 1785058915050848256
author Ding, Youde
Liao, Yuan
He, Ji
Ma, Jianfeng
Wei, Xu
Liu, Xuemei
Zhang, Guiying
Wang, Jing
author_facet Ding, Youde
Liao, Yuan
He, Ji
Ma, Jianfeng
Wei, Xu
Liu, Xuemei
Zhang, Guiying
Wang, Jing
author_sort Ding, Youde
collection PubMed
description Background: With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression and decompression according to the characteristics of the data to speed up data transmission and processing requires research on relevant compression algorithms. Methods: In this paper, a compression algorithm for sparse asymmetric gene mutations (CA_SAGM) based on the characteristics of sparse genomic mutation data was proposed. The data was first sorted on a row-first basis so that neighboring non-zero elements were as close as possible to each other. The data were then renumbered using the reverse Cuthill-Mckee sorting technique. Finally the data were compressed into sparse row format (CSR) and stored. We had analyzed and compared the results of the CA_SAGM, coordinate format (COO) and compressed sparse column format (CSC) algorithms for sparse asymmetric genomic data. Nine types of single-nucleotide variation (SNV) data and six types of copy number variation (CNV) data from the TCGA database were used as the subjects of this study. Compression and decompression time, compression and decompression rate, compression memory and compression ratio were used as evaluation metrics. The correlation between each metric and the basic characteristics of the original data was further investigated. Results: The experimental results showed that the COO method had the shortest compression time, the fastest compression rate and the largest compression ratio, and had the best compression performance. CSC compression performance was the worst, and CA_SAGM compression performance was between the two. When decompressing the data, CA_SAGM performed the best, with the shortest decompression time and the fastest decompression rate. COO decompression performance was the worst. With increasing sparsity, the COO, CSC and CA_SAGM algorithms all exhibited longer compression and decompression times, lower compression and decompression rates, larger compression memory and lower compression ratios. When the sparsity was large, the compression memory and compression ratio of the three algorithms showed no difference characteristics, but the rest of the indexes were still different. Conclusion: CA_SAGM was an efficient compression algorithm that combines compression and decompression performance for sparse genomic mutation data.
format Online
Article
Text
id pubmed-10267386
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-102673862023-06-15 Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity Ding, Youde Liao, Yuan He, Ji Ma, Jianfeng Wei, Xu Liu, Xuemei Zhang, Guiying Wang, Jing Front Genet Genetics Background: With the rapid development of high-throughput sequencing technology and the explosive growth of genomic data, storing, transmitting and processing massive amounts of data has become a new challenge. How to achieve fast lossless compression and decompression according to the characteristics of the data to speed up data transmission and processing requires research on relevant compression algorithms. Methods: In this paper, a compression algorithm for sparse asymmetric gene mutations (CA_SAGM) based on the characteristics of sparse genomic mutation data was proposed. The data was first sorted on a row-first basis so that neighboring non-zero elements were as close as possible to each other. The data were then renumbered using the reverse Cuthill-Mckee sorting technique. Finally the data were compressed into sparse row format (CSR) and stored. We had analyzed and compared the results of the CA_SAGM, coordinate format (COO) and compressed sparse column format (CSC) algorithms for sparse asymmetric genomic data. Nine types of single-nucleotide variation (SNV) data and six types of copy number variation (CNV) data from the TCGA database were used as the subjects of this study. Compression and decompression time, compression and decompression rate, compression memory and compression ratio were used as evaluation metrics. The correlation between each metric and the basic characteristics of the original data was further investigated. Results: The experimental results showed that the COO method had the shortest compression time, the fastest compression rate and the largest compression ratio, and had the best compression performance. CSC compression performance was the worst, and CA_SAGM compression performance was between the two. When decompressing the data, CA_SAGM performed the best, with the shortest decompression time and the fastest decompression rate. COO decompression performance was the worst. With increasing sparsity, the COO, CSC and CA_SAGM algorithms all exhibited longer compression and decompression times, lower compression and decompression rates, larger compression memory and lower compression ratios. When the sparsity was large, the compression memory and compression ratio of the three algorithms showed no difference characteristics, but the rest of the indexes were still different. Conclusion: CA_SAGM was an efficient compression algorithm that combines compression and decompression performance for sparse genomic mutation data. Frontiers Media S.A. 2023-06-01 /pmc/articles/PMC10267386/ /pubmed/37323665 http://dx.doi.org/10.3389/fgene.2023.1213907 Text en Copyright © 2023 Ding, Liao, He, Ma, Wei, Liu, Zhang and Wang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Genetics
Ding, Youde
Liao, Yuan
He, Ji
Ma, Jianfeng
Wei, Xu
Liu, Xuemei
Zhang, Guiying
Wang, Jing
Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity
title Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity
title_full Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity
title_fullStr Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity
title_full_unstemmed Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity
title_short Enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity
title_sort enhancing genomic mutation data storage optimization based on the compression of asymmetry of sparsity
topic Genetics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10267386/
https://www.ncbi.nlm.nih.gov/pubmed/37323665
http://dx.doi.org/10.3389/fgene.2023.1213907
work_keys_str_mv AT dingyoude enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity
AT liaoyuan enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity
AT heji enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity
AT majianfeng enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity
AT weixu enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity
AT liuxuemei enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity
AT zhangguiying enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity
AT wangjing enhancinggenomicmutationdatastorageoptimizationbasedonthecompressionofasymmetryofsparsity