Cargando…
A comparison study of succinct data structures for use in GWAS
BACKGROUND: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647–657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly impor...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879196/ https://www.ncbi.nlm.nih.gov/pubmed/24359123 http://dx.doi.org/10.1186/1471-2105-14-369 |
_version_ | 1782297934017069056 |
---|---|
author | Putnam, Patrick P Zhang, Ge Wilsey, Philip A |
author_facet | Putnam, Patrick P Zhang, Ge Wilsey, Philip A |
author_sort | Putnam, Patrick P |
collection | PubMed |
description | BACKGROUND: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647–657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly important topics which need to be addressed. The use of succinct data structures is one method of reducing physical size of a data set without the use of expensive compression techniques. In this work, we consider the use of 2- and 3-bit encoding schemes for genotype data. We compare the computational performance of allele or genotype counting algorithms utilizing genotype data encoded in both schemes. RESULTS: We perform a comparison of 2- and 3-bit genotype encoding schemes for use in genotype counting algorithms. We find that there is a 20% overhead when building simple frequency tables from 2-bit encoded genotypes. However, building pairwise count tables for genome-wide epistasis is 1.0% more efficient. CONCLUSIONS: In this work, we were concerned with comparing the performance benefits and disadvantages of using more densely packed genotype data representations in Genome Wide Associations Studies (GWAS). We implemented a 2-bit encoding for genotype data, and compared it against a more commonly used 3-bit encoding scheme. We also developed a C++ library, libgwaspp, which offers these data structures, and implementations of several common GWAS algorithms. In general, the 2-bit encoding consumes less memory, and is slightly more efficient in some algorithms than the 3-bit encoding. |
format | Online Article Text |
id | pubmed-3879196 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-38791962014-01-03 A comparison study of succinct data structures for use in GWAS Putnam, Patrick P Zhang, Ge Wilsey, Philip A BMC Bioinformatics Software BACKGROUND: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647–657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly important topics which need to be addressed. The use of succinct data structures is one method of reducing physical size of a data set without the use of expensive compression techniques. In this work, we consider the use of 2- and 3-bit encoding schemes for genotype data. We compare the computational performance of allele or genotype counting algorithms utilizing genotype data encoded in both schemes. RESULTS: We perform a comparison of 2- and 3-bit genotype encoding schemes for use in genotype counting algorithms. We find that there is a 20% overhead when building simple frequency tables from 2-bit encoded genotypes. However, building pairwise count tables for genome-wide epistasis is 1.0% more efficient. CONCLUSIONS: In this work, we were concerned with comparing the performance benefits and disadvantages of using more densely packed genotype data representations in Genome Wide Associations Studies (GWAS). We implemented a 2-bit encoding for genotype data, and compared it against a more commonly used 3-bit encoding scheme. We also developed a C++ library, libgwaspp, which offers these data structures, and implementations of several common GWAS algorithms. In general, the 2-bit encoding consumes less memory, and is slightly more efficient in some algorithms than the 3-bit encoding. BioMed Central 2013-12-21 /pmc/articles/PMC3879196/ /pubmed/24359123 http://dx.doi.org/10.1186/1471-2105-14-369 Text en Copyright © 2013 Putnam et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software Putnam, Patrick P Zhang, Ge Wilsey, Philip A A comparison study of succinct data structures for use in GWAS |
title | A comparison study of succinct data structures for use in GWAS |
title_full | A comparison study of succinct data structures for use in GWAS |
title_fullStr | A comparison study of succinct data structures for use in GWAS |
title_full_unstemmed | A comparison study of succinct data structures for use in GWAS |
title_short | A comparison study of succinct data structures for use in GWAS |
title_sort | comparison study of succinct data structures for use in gwas |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3879196/ https://www.ncbi.nlm.nih.gov/pubmed/24359123 http://dx.doi.org/10.1186/1471-2105-14-369 |
work_keys_str_mv | AT putnampatrickp acomparisonstudyofsuccinctdatastructuresforuseingwas AT zhangge acomparisonstudyofsuccinctdatastructuresforuseingwas AT wilseyphilipa acomparisonstudyofsuccinctdatastructuresforuseingwas AT putnampatrickp comparisonstudyofsuccinctdatastructuresforuseingwas AT zhangge comparisonstudyofsuccinctdatastructuresforuseingwas AT wilseyphilipa comparisonstudyofsuccinctdatastructuresforuseingwas |