Cargando…

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

BACKGROUND: As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researcher...

Descripción completa

Detalles Bibliográficos
Autores principales:	Qiao, Dandi, Yip, Wai-Ki, Lange, Christoph
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3434015/ https://www.ncbi.nlm.nih.gov/pubmed/22591016 http://dx.doi.org/10.1186/1471-2105-13-100

_version_	1782242372875190272
author	Qiao, Dandi Yip, Wai-Ki Lange, Christoph
author_facet	Qiao, Dandi Yip, Wai-Ki Lange, Christoph
author_sort	Qiao, Dandi
collection	PubMed
description	BACKGROUND: As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. RESULTS: Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs. CONCLUSIONS: The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.
format	Online Article Text
id	pubmed-3434015
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-34340152012-09-10 Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data Qiao, Dandi Yip, Wai-Ki Lange, Christoph BMC Bioinformatics Methodology Article BACKGROUND: As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. RESULTS: Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs. CONCLUSIONS: The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary. BioMed Central 2012-05-16 /pmc/articles/PMC3434015/ /pubmed/22591016 http://dx.doi.org/10.1186/1471-2105-13-100 Text en Copyright ©2012 Qiao et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Qiao, Dandi Yip, Wai-Ki Lange, Christoph Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
title	Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
title_full	Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
title_fullStr	Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
title_full_unstemmed	Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
title_short	Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data
title_sort	handling the data management needs of high-throughput sequencing data: speedgene, a compression algorithm for the efficient storage of genetic data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3434015/ https://www.ncbi.nlm.nih.gov/pubmed/22591016 http://dx.doi.org/10.1186/1471-2105-13-100
work_keys_str_mv	AT qiaodandi handlingthedatamanagementneedsofhighthroughputsequencingdataspeedgeneacompressionalgorithmfortheefficientstorageofgeneticdata AT yipwaiki handlingthedatamanagementneedsofhighthroughputsequencingdataspeedgeneacompressionalgorithmfortheefficientstorageofgeneticdata AT langechristoph handlingthedatamanagementneedsofhighthroughputsequencingdataspeedgeneacompressionalgorithmfortheefficientstorageofgeneticdata

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Ejemplares similares