Cargando…

TRCMGene: A two-step referential compression method for the efficient storage of genetic data

BACKGROUND: The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is im...

Descripción completa

Detalles Bibliográficos
Autores principales: Tang, You, Li, Min, Sun, Jing, Zhang, Tao, Zhang, Jicheng, Zheng, Ping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6218042/
https://www.ncbi.nlm.nih.gov/pubmed/30395579
http://dx.doi.org/10.1371/journal.pone.0206521
_version_ 1783368387461644288
author Tang, You
Li, Min
Sun, Jing
Zhang, Tao
Zhang, Jicheng
Zheng, Ping
author_facet Tang, You
Li, Min
Sun, Jing
Zhang, Tao
Zhang, Jicheng
Zheng, Ping
author_sort Tang, You
collection PubMed
description BACKGROUND: The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading. RESULTS: Here, we propose TRCMGene, a lossless genetic data compression method that uses a referential compression scheme. The novel concept of two-step compression method, which builds an index structure using K-means and k-nearest neighbours, is introduced to TRCMGene. Evaluation with several real datasets revealed that the compression factor of TRCMGene ranges from 9 to 21. TRCMGene presents a good balance between compression factor and reading time. On average, the reading time of compressed data is 60% of that of uncompressed data. Thus, TRCMGene not only saves disc space but also saves file access time and speeds up data loading. These effects collectively improve genetic data storage and transmission in the current hardware environment and render system upgrades unnecessary. TRCMGene, user manual and demos could be accessed freely from https://github.com/tangyou79/TRCM. The data mentioned in this manuscript could be downloaded from: https://github.com/tangyou79/TRCM/wiki.
format Online
Article
Text
id pubmed-6218042
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-62180422018-11-19 TRCMGene: A two-step referential compression method for the efficient storage of genetic data Tang, You Li, Min Sun, Jing Zhang, Tao Zhang, Jicheng Zheng, Ping PLoS One Research Article BACKGROUND: The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading. RESULTS: Here, we propose TRCMGene, a lossless genetic data compression method that uses a referential compression scheme. The novel concept of two-step compression method, which builds an index structure using K-means and k-nearest neighbours, is introduced to TRCMGene. Evaluation with several real datasets revealed that the compression factor of TRCMGene ranges from 9 to 21. TRCMGene presents a good balance between compression factor and reading time. On average, the reading time of compressed data is 60% of that of uncompressed data. Thus, TRCMGene not only saves disc space but also saves file access time and speeds up data loading. These effects collectively improve genetic data storage and transmission in the current hardware environment and render system upgrades unnecessary. TRCMGene, user manual and demos could be accessed freely from https://github.com/tangyou79/TRCM. The data mentioned in this manuscript could be downloaded from: https://github.com/tangyou79/TRCM/wiki. Public Library of Science 2018-11-05 /pmc/articles/PMC6218042/ /pubmed/30395579 http://dx.doi.org/10.1371/journal.pone.0206521 Text en © 2018 Tang et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Tang, You
Li, Min
Sun, Jing
Zhang, Tao
Zhang, Jicheng
Zheng, Ping
TRCMGene: A two-step referential compression method for the efficient storage of genetic data
title TRCMGene: A two-step referential compression method for the efficient storage of genetic data
title_full TRCMGene: A two-step referential compression method for the efficient storage of genetic data
title_fullStr TRCMGene: A two-step referential compression method for the efficient storage of genetic data
title_full_unstemmed TRCMGene: A two-step referential compression method for the efficient storage of genetic data
title_short TRCMGene: A two-step referential compression method for the efficient storage of genetic data
title_sort trcmgene: a two-step referential compression method for the efficient storage of genetic data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6218042/
https://www.ncbi.nlm.nih.gov/pubmed/30395579
http://dx.doi.org/10.1371/journal.pone.0206521
work_keys_str_mv AT tangyou trcmgeneatwostepreferentialcompressionmethodfortheefficientstorageofgeneticdata
AT limin trcmgeneatwostepreferentialcompressionmethodfortheefficientstorageofgeneticdata
AT sunjing trcmgeneatwostepreferentialcompressionmethodfortheefficientstorageofgeneticdata
AT zhangtao trcmgeneatwostepreferentialcompressionmethodfortheefficientstorageofgeneticdata
AT zhangjicheng trcmgeneatwostepreferentialcompressionmethodfortheefficientstorageofgeneticdata
AT zhengping trcmgeneatwostepreferentialcompressionmethodfortheefficientstorageofgeneticdata