Cargando…

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data

With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome...

Descripción completa

Detalles Bibliográficos
Autores principales: Yao, Haichang, Ji, Yimu, Li, Kui, Liu, Shangdong, He, Jing, Wang, Ruchuan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Hindawi 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6930768/
https://www.ncbi.nlm.nih.gov/pubmed/31915686
http://dx.doi.org/10.1155/2019/3108950
_version_ 1783482969207341056
author Yao, Haichang
Ji, Yimu
Li, Kui
Liu, Shangdong
He, Jing
Wang, Ruchuan
author_facet Yao, Haichang
Ji, Yimu
Li, Kui
Liu, Shangdong
He, Jing
Wang, Ruchuan
author_sort Yao, Haichang
collection PubMed
description With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome sequences, is widely used in the Gene Bank because it eases sequence analysis and gene research and is easy to be read. Many compression methods for FASTA genome sequences have been proposed, but they still have room for improvement. For example, the compression ratio and speed are not so high and robust enough, and memory consumption is not ideal, etc. Therefore, it is of great significance to improve the efficiency, robustness, and practicability of genomic data compression to reduce the storage and transmission cost of genomic data further and promote the research and development of genomic technology. In this manuscript, a hybrid referential compression method (HRCM) for FASTA genome sequences is proposed. HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences. It is implemented through three stages: sequence information extraction, sequence information matching, and sequence information encoding. A large number of experiments fully evaluated the performance of HRCM. Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression. Moreover, HRCM memory consumption is relatively low and can be deployed on standard PCs.
format Online
Article
Text
id pubmed-6930768
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Hindawi
record_format MEDLINE/PubMed
spelling pubmed-69307682020-01-08 HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data Yao, Haichang Ji, Yimu Li, Kui Liu, Shangdong He, Jing Wang, Ruchuan Biomed Res Int Research Article With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome sequences, is widely used in the Gene Bank because it eases sequence analysis and gene research and is easy to be read. Many compression methods for FASTA genome sequences have been proposed, but they still have room for improvement. For example, the compression ratio and speed are not so high and robust enough, and memory consumption is not ideal, etc. Therefore, it is of great significance to improve the efficiency, robustness, and practicability of genomic data compression to reduce the storage and transmission cost of genomic data further and promote the research and development of genomic technology. In this manuscript, a hybrid referential compression method (HRCM) for FASTA genome sequences is proposed. HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences. It is implemented through three stages: sequence information extraction, sequence information matching, and sequence information encoding. A large number of experiments fully evaluated the performance of HRCM. Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression. Moreover, HRCM memory consumption is relatively low and can be deployed on standard PCs. Hindawi 2019-11-16 /pmc/articles/PMC6930768/ /pubmed/31915686 http://dx.doi.org/10.1155/2019/3108950 Text en Copyright © 2019 Haichang Yao et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Yao, Haichang
Ji, Yimu
Li, Kui
Liu, Shangdong
He, Jing
Wang, Ruchuan
HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
title HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
title_full HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
title_fullStr HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
title_full_unstemmed HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
title_short HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data
title_sort hrcm: an efficient hybrid referential compression method for genomic big data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6930768/
https://www.ncbi.nlm.nih.gov/pubmed/31915686
http://dx.doi.org/10.1155/2019/3108950
work_keys_str_mv AT yaohaichang hrcmanefficienthybridreferentialcompressionmethodforgenomicbigdata
AT jiyimu hrcmanefficienthybridreferentialcompressionmethodforgenomicbigdata
AT likui hrcmanefficienthybridreferentialcompressionmethodforgenomicbigdata
AT liushangdong hrcmanefficienthybridreferentialcompressionmethodforgenomicbigdata
AT hejing hrcmanefficienthybridreferentialcompressionmethodforgenomicbigdata
AT wangruchuan hrcmanefficienthybridreferentialcompressionmethodforgenomicbigdata