Cargando…

Second-generation PLINK: rising to the challenge of larger and richer datasets

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementation...

Descripción completa

Detalles Bibliográficos
Autores principales: Chang, Christopher C, Chow, Carson C, Tellier, Laurent CAM, Vattikuti, Shashaank, Purcell, Shaun M, Lee, James J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342193/
https://www.ncbi.nlm.nih.gov/pubmed/25722852
http://dx.doi.org/10.1186/s13742-015-0047-8
_version_ 1782359249990451200
author Chang, Christopher C
Chow, Carson C
Tellier, Laurent CAM
Vattikuti, Shashaank
Purcell, Shaun M
Lee, James J
author_facet Chang, Christopher C
Chow, Carson C
Tellier, Laurent CAM
Vattikuti, Shashaank
Purcell, Shaun M
Lee, James J
author_sort Chang, Christopher C
collection PubMed
description BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text] -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0047-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4342193
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-43421932015-02-27 Second-generation PLINK: rising to the challenge of larger and richer datasets Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J Gigascience Technical Note BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text] -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0047-8) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-25 /pmc/articles/PMC4342193/ /pubmed/25722852 http://dx.doi.org/10.1186/s13742-015-0047-8 Text en © Chang et al.; licensee BioMed Central. 2015 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Technical Note
Chang, Christopher C
Chow, Carson C
Tellier, Laurent CAM
Vattikuti, Shashaank
Purcell, Shaun M
Lee, James J
Second-generation PLINK: rising to the challenge of larger and richer datasets
title Second-generation PLINK: rising to the challenge of larger and richer datasets
title_full Second-generation PLINK: rising to the challenge of larger and richer datasets
title_fullStr Second-generation PLINK: rising to the challenge of larger and richer datasets
title_full_unstemmed Second-generation PLINK: rising to the challenge of larger and richer datasets
title_short Second-generation PLINK: rising to the challenge of larger and richer datasets
title_sort second-generation plink: rising to the challenge of larger and richer datasets
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342193/
https://www.ncbi.nlm.nih.gov/pubmed/25722852
http://dx.doi.org/10.1186/s13742-015-0047-8
work_keys_str_mv AT changchristopherc secondgenerationplinkrisingtothechallengeoflargerandricherdatasets
AT chowcarsonc secondgenerationplinkrisingtothechallengeoflargerandricherdatasets
AT tellierlaurentcam secondgenerationplinkrisingtothechallengeoflargerandricherdatasets
AT vattikutishashaank secondgenerationplinkrisingtothechallengeoflargerandricherdatasets
AT purcellshaunm secondgenerationplinkrisingtothechallengeoflargerandricherdatasets
AT leejamesj secondgenerationplinkrisingtothechallengeoflargerandricherdatasets