Cargando…

Scalable probabilistic PCA for large-scale genetic variation data

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal compon...

Descripción completa

Detalles Bibliográficos
Autores principales: Agrawal, Aman, Chiu, Alec M., Le, Minh, Halperin, Eran, Sankararaman, Sriram
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7286535/
https://www.ncbi.nlm.nih.gov/pubmed/32469896
http://dx.doi.org/10.1371/journal.pgen.1008773
_version_ 1783544898592440320
author Agrawal, Aman
Chiu, Alec M.
Le, Minh
Halperin, Eran
Sankararaman, Sriram
author_facet Agrawal, Aman
Chiu, Alec M.
Le, Minh
Halperin, Eran
Sankararaman, Sriram
author_sort Agrawal, Aman
collection PubMed
description Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.
format Online
Article
Text
id pubmed-7286535
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-72865352020-06-15 Scalable probabilistic PCA for large-scale genetic variation data Agrawal, Aman Chiu, Alec M. Le, Minh Halperin, Eran Sankararaman, Sriram PLoS Genet Research Article Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4. Public Library of Science 2020-05-29 /pmc/articles/PMC7286535/ /pubmed/32469896 http://dx.doi.org/10.1371/journal.pgen.1008773 Text en © 2020 Agrawal et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Agrawal, Aman
Chiu, Alec M.
Le, Minh
Halperin, Eran
Sankararaman, Sriram
Scalable probabilistic PCA for large-scale genetic variation data
title Scalable probabilistic PCA for large-scale genetic variation data
title_full Scalable probabilistic PCA for large-scale genetic variation data
title_fullStr Scalable probabilistic PCA for large-scale genetic variation data
title_full_unstemmed Scalable probabilistic PCA for large-scale genetic variation data
title_short Scalable probabilistic PCA for large-scale genetic variation data
title_sort scalable probabilistic pca for large-scale genetic variation data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7286535/
https://www.ncbi.nlm.nih.gov/pubmed/32469896
http://dx.doi.org/10.1371/journal.pgen.1008773
work_keys_str_mv AT agrawalaman scalableprobabilisticpcaforlargescalegeneticvariationdata
AT chiualecm scalableprobabilisticpcaforlargescalegeneticvariationdata
AT leminh scalableprobabilisticpcaforlargescalegeneticvariationdata
AT halperineran scalableprobabilisticpcaforlargescalegeneticvariationdata
AT sankararamansriram scalableprobabilisticpcaforlargescalegeneticvariationdata