Cargando…

Fast Ordered Sampling of DNA Sequence Variants

Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-...

Descripción completa

Detalles Bibliográficos
Autor principal: Greenberg, Anthony J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Genetics Society of America 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5940139/
https://www.ncbi.nlm.nih.gov/pubmed/29531124
http://dx.doi.org/10.1534/g3.117.300465
_version_ 1783321054459985920
author Greenberg, Anthony J.
author_facet Greenberg, Anthony J.
author_sort Greenberg, Anthony J.
collection PubMed
description Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects.
format Online
Article
Text
id pubmed-5940139
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Genetics Society of America
record_format MEDLINE/PubMed
spelling pubmed-59401392018-05-10 Fast Ordered Sampling of DNA Sequence Variants Greenberg, Anthony J. G3 (Bethesda) Software and Data Resources Explosive growth in the amount of genomic data is matched by increasing power of consumer-grade computers. Even applications that require powerful servers can be quickly tested on desktop or laptop machines if we can generate representative samples from large data sets. I describe a fast and memory-efficient implementation of an on-line sampling method developed for tape drives 30 years ago. Focusing on genotype files, I test the performance of this technique on modern solid-state and spinning hard drives, and show that it performs well compared to a simple sampling scheme. I illustrate its utility by developing a method to quickly estimate genome-wide patterns of linkage disequilibrium (LD) decay with distance. I provide open-source software that samples loci from several variant format files, a separate program that performs LD decay estimates, and a C++ library that lets developers incorporate these methods into their own projects. Genetics Society of America 2018-03-12 /pmc/articles/PMC5940139/ /pubmed/29531124 http://dx.doi.org/10.1534/g3.117.300465 Text en Copyright © 2018 Greenberg http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software and Data Resources
Greenberg, Anthony J.
Fast Ordered Sampling of DNA Sequence Variants
title Fast Ordered Sampling of DNA Sequence Variants
title_full Fast Ordered Sampling of DNA Sequence Variants
title_fullStr Fast Ordered Sampling of DNA Sequence Variants
title_full_unstemmed Fast Ordered Sampling of DNA Sequence Variants
title_short Fast Ordered Sampling of DNA Sequence Variants
title_sort fast ordered sampling of dna sequence variants
topic Software and Data Resources
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5940139/
https://www.ncbi.nlm.nih.gov/pubmed/29531124
http://dx.doi.org/10.1534/g3.117.300465
work_keys_str_mv AT greenberganthonyj fastorderedsamplingofdnasequencevariants