Cargando…

A comparison study of succinct data structures for use in GWAS

BACKGROUND: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647–657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly impor...

Descripción completa

Detalles Bibliográficos
Autores principales: Putnam, Patrick P, Zhang, Ge, Wilsey, Philip A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879196/
https://www.ncbi.nlm.nih.gov/pubmed/24359123
http://dx.doi.org/10.1186/1471-2105-14-369
_version_ 1782297934017069056
author Putnam, Patrick P
Zhang, Ge
Wilsey, Philip A
author_facet Putnam, Patrick P
Zhang, Ge
Wilsey, Philip A
author_sort Putnam, Patrick P
collection PubMed
description BACKGROUND: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647–657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly important topics which need to be addressed. The use of succinct data structures is one method of reducing physical size of a data set without the use of expensive compression techniques. In this work, we consider the use of 2- and 3-bit encoding schemes for genotype data. We compare the computational performance of allele or genotype counting algorithms utilizing genotype data encoded in both schemes. RESULTS: We perform a comparison of 2- and 3-bit genotype encoding schemes for use in genotype counting algorithms. We find that there is a 20% overhead when building simple frequency tables from 2-bit encoded genotypes. However, building pairwise count tables for genome-wide epistasis is 1.0% more efficient. CONCLUSIONS: In this work, we were concerned with comparing the performance benefits and disadvantages of using more densely packed genotype data representations in Genome Wide Associations Studies (GWAS). We implemented a 2-bit encoding for genotype data, and compared it against a more commonly used 3-bit encoding scheme. We also developed a C++ library, libgwaspp, which offers these data structures, and implementations of several common GWAS algorithms. In general, the 2-bit encoding consumes less memory, and is slightly more efficient in some algorithms than the 3-bit encoding.
format Online
Article
Text
id pubmed-3879196
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-38791962014-01-03 A comparison study of succinct data structures for use in GWAS Putnam, Patrick P Zhang, Ge Wilsey, Philip A BMC Bioinformatics Software BACKGROUND: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647–657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly important topics which need to be addressed. The use of succinct data structures is one method of reducing physical size of a data set without the use of expensive compression techniques. In this work, we consider the use of 2- and 3-bit encoding schemes for genotype data. We compare the computational performance of allele or genotype counting algorithms utilizing genotype data encoded in both schemes. RESULTS: We perform a comparison of 2- and 3-bit genotype encoding schemes for use in genotype counting algorithms. We find that there is a 20% overhead when building simple frequency tables from 2-bit encoded genotypes. However, building pairwise count tables for genome-wide epistasis is 1.0% more efficient. CONCLUSIONS: In this work, we were concerned with comparing the performance benefits and disadvantages of using more densely packed genotype data representations in Genome Wide Associations Studies (GWAS). We implemented a 2-bit encoding for genotype data, and compared it against a more commonly used 3-bit encoding scheme. We also developed a C++ library, libgwaspp, which offers these data structures, and implementations of several common GWAS algorithms. In general, the 2-bit encoding consumes less memory, and is slightly more efficient in some algorithms than the 3-bit encoding. BioMed Central 2013-12-21 /pmc/articles/PMC3879196/ /pubmed/24359123 http://dx.doi.org/10.1186/1471-2105-14-369 Text en Copyright © 2013 Putnam et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Putnam, Patrick P
Zhang, Ge
Wilsey, Philip A
A comparison study of succinct data structures for use in GWAS
title A comparison study of succinct data structures for use in GWAS
title_full A comparison study of succinct data structures for use in GWAS
title_fullStr A comparison study of succinct data structures for use in GWAS
title_full_unstemmed A comparison study of succinct data structures for use in GWAS
title_short A comparison study of succinct data structures for use in GWAS
title_sort comparison study of succinct data structures for use in gwas
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879196/
https://www.ncbi.nlm.nih.gov/pubmed/24359123
http://dx.doi.org/10.1186/1471-2105-14-369
work_keys_str_mv AT putnampatrickp acomparisonstudyofsuccinctdatastructuresforuseingwas
AT zhangge acomparisonstudyofsuccinctdatastructuresforuseingwas
AT wilseyphilipa acomparisonstudyofsuccinctdatastructuresforuseingwas
AT putnampatrickp comparisonstudyofsuccinctdatastructuresforuseingwas
AT zhangge comparisonstudyofsuccinctdatastructuresforuseingwas
AT wilseyphilipa comparisonstudyofsuccinctdatastructuresforuseingwas