Cargando…

Second-generation PLINK: rising to the challenge of larger and richer datasets

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementation...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chang, Christopher C, Chow, Carson C, Tellier, Laurent CAM, Vattikuti, Shashaank, Purcell, Shaun M, Lee, James J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342193/ https://www.ncbi.nlm.nih.gov/pubmed/25722852 http://dx.doi.org/10.1186/s13742-015-0047-8

_version_	1782359249990451200
author	Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J
author_facet	Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J
author_sort	Chang, Christopher C
collection	PubMed
description	BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text] -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0047-8) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4342193
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43421932015-02-27 Second-generation PLINK: rising to the challenge of larger and richer datasets Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J Gigascience Technical Note BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text] -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0047-8) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-25 /pmc/articles/PMC4342193/ /pubmed/25722852 http://dx.doi.org/10.1186/s13742-015-0047-8 Text en © Chang et al.; licensee BioMed Central. 2015 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Technical Note Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J Second-generation PLINK: rising to the challenge of larger and richer datasets
title	Second-generation PLINK: rising to the challenge of larger and richer datasets
title_full	Second-generation PLINK: rising to the challenge of larger and richer datasets
title_fullStr	Second-generation PLINK: rising to the challenge of larger and richer datasets
title_full_unstemmed	Second-generation PLINK: rising to the challenge of larger and richer datasets
title_short	Second-generation PLINK: rising to the challenge of larger and richer datasets
title_sort	second-generation plink: rising to the challenge of larger and richer datasets
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342193/ https://www.ncbi.nlm.nih.gov/pubmed/25722852 http://dx.doi.org/10.1186/s13742-015-0047-8
work_keys_str_mv	AT changchristopherc secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT chowcarsonc secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT tellierlaurentcam secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT vattikutishashaank secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT purcellshaunm secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT leejamesj secondgenerationplinkrisingtothechallengeoflargerandricherdatasets

Second-generation PLINK: rising to the challenge of larger and richer datasets

Ejemplares similares