Cargando…
Fast Ordered Sampling of DNA Sequence Variants
Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Genetics Society of America
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5940139/ https://www.ncbi.nlm.nih.gov/pubmed/29531124 http://dx.doi.org/10.1534/g3.117.300465 |
_version_ | 1783321054459985920 |
---|---|
author | Greenberg, Anthony J. |
author_facet | Greenberg, Anthony J. |
author_sort | Greenberg, Anthony J. |
collection | PubMed |
description | Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects. |
format | Online Article Text |
id | pubmed-5940139 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Genetics Society of America |
record_format | MEDLINE/PubMed |
spelling | pubmed-59401392018-05-10 Fast Ordered Sampling of DNA Sequence Variants Greenberg, Anthony J. G3 (Bethesda) Software and Data Resources Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects. Genetics Society of America 2018-03-12 /pmc/articles/PMC5940139/ /pubmed/29531124 http://dx.doi.org/10.1534/g3.117.300465 Text en Copyright © 2018 Greenberg http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software and Data Resources Greenberg, Anthony J. Fast Ordered Sampling of DNA Sequence Variants |
title | Fast Ordered Sampling of DNA Sequence Variants |
title_full | Fast Ordered Sampling of DNA Sequence Variants |
title_fullStr | Fast Ordered Sampling of DNA Sequence Variants |
title_full_unstemmed | Fast Ordered Sampling of DNA Sequence Variants |
title_short | Fast Ordered Sampling of DNA Sequence Variants |
title_sort | fast ordered sampling of dna sequence variants |
topic | Software and Data Resources |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5940139/ https://www.ncbi.nlm.nih.gov/pubmed/29531124 http://dx.doi.org/10.1534/g3.117.300465 |
work_keys_str_mv | AT greenberganthonyj fastorderedsamplingofdnasequencevariants |