Cargando…

Efficient toolkit implementing best practices for principal component analysis of population genetic data

MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage di...

Descripción completa

Detalles Bibliográficos
Autores principales:	Privé, Florian, Luu, Keurcien, Blum, Michael G B, McGrath, John J, Vilhjálmsson, Bjarni J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7750941/ https://www.ncbi.nlm.nih.gov/pubmed/32415959 http://dx.doi.org/10.1093/bioinformatics/btaa520

_version_	1783625576295170048
author	Privé, Florian Luu, Keurcien Blum, Michael G B McGrath, John J Vilhjálmsson, Bjarni J
author_facet	Privé, Florian Luu, Keurcien Blum, Michael G B McGrath, John J Vilhjálmsson, Bjarni J
author_sort	Privé, Florian
collection	PubMed
description	MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. RESULTS: For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. AVAILABILITY AND IMPLEMENTATION: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-7750941
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-77509412020-12-28 Efficient toolkit implementing best practices for principal component analysis of population genetic data Privé, Florian Luu, Keurcien Blum, Michael G B McGrath, John J Vilhjálmsson, Bjarni J Bioinformatics Original Papers MOTIVATION: Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr. RESULTS: For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data. AVAILABILITY AND IMPLEMENTATION: R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-05-16 /pmc/articles/PMC7750941/ /pubmed/32415959 http://dx.doi.org/10.1093/bioinformatics/btaa520 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Privé, Florian Luu, Keurcien Blum, Michael G B McGrath, John J Vilhjálmsson, Bjarni J Efficient toolkit implementing best practices for principal component analysis of population genetic data
title	Efficient toolkit implementing best practices for principal component analysis of population genetic data
title_full	Efficient toolkit implementing best practices for principal component analysis of population genetic data
title_fullStr	Efficient toolkit implementing best practices for principal component analysis of population genetic data
title_full_unstemmed	Efficient toolkit implementing best practices for principal component analysis of population genetic data
title_short	Efficient toolkit implementing best practices for principal component analysis of population genetic data
title_sort	efficient toolkit implementing best practices for principal component analysis of population genetic data
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7750941/ https://www.ncbi.nlm.nih.gov/pubmed/32415959 http://dx.doi.org/10.1093/bioinformatics/btaa520
work_keys_str_mv	AT priveflorian efficienttoolkitimplementingbestpracticesforprincipalcomponentanalysisofpopulationgeneticdata AT luukeurcien efficienttoolkitimplementingbestpracticesforprincipalcomponentanalysisofpopulationgeneticdata AT blummichaelgb efficienttoolkitimplementingbestpracticesforprincipalcomponentanalysisofpopulationgeneticdata AT mcgrathjohnj efficienttoolkitimplementingbestpracticesforprincipalcomponentanalysisofpopulationgeneticdata AT vilhjalmssonbjarnij efficienttoolkitimplementingbestpracticesforprincipalcomponentanalysisofpopulationgeneticdata

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Ejemplares similares