Cargando…
Second-generation PLINK: rising to the challenge of larger and richer datasets
BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementation...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342193/ https://www.ncbi.nlm.nih.gov/pubmed/25722852 http://dx.doi.org/10.1186/s13742-015-0047-8 |
_version_ | 1782359249990451200 |
---|---|
author | Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J |
author_facet | Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J |
author_sort | Chang, Christopher C |
collection | PubMed |
description | BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text] -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0047-8) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4342193 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-43421932015-02-27 Second-generation PLINK: rising to the challenge of larger and richer datasets Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J Gigascience Technical Note BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text] -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13742-015-0047-8) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-25 /pmc/articles/PMC4342193/ /pubmed/25722852 http://dx.doi.org/10.1186/s13742-015-0047-8 Text en © Chang et al.; licensee BioMed Central. 2015 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Technical Note Chang, Christopher C Chow, Carson C Tellier, Laurent CAM Vattikuti, Shashaank Purcell, Shaun M Lee, James J Second-generation PLINK: rising to the challenge of larger and richer datasets |
title | Second-generation PLINK: rising to the challenge of larger and richer datasets |
title_full | Second-generation PLINK: rising to the challenge of larger and richer datasets |
title_fullStr | Second-generation PLINK: rising to the challenge of larger and richer datasets |
title_full_unstemmed | Second-generation PLINK: rising to the challenge of larger and richer datasets |
title_short | Second-generation PLINK: rising to the challenge of larger and richer datasets |
title_sort | second-generation plink: rising to the challenge of larger and richer datasets |
topic | Technical Note |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4342193/ https://www.ncbi.nlm.nih.gov/pubmed/25722852 http://dx.doi.org/10.1186/s13742-015-0047-8 |
work_keys_str_mv | AT changchristopherc secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT chowcarsonc secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT tellierlaurentcam secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT vattikutishashaank secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT purcellshaunm secondgenerationplinkrisingtothechallengeoflargerandricherdatasets AT leejamesj secondgenerationplinkrisingtothechallengeoflargerandricherdatasets |