Cargando…
On rare variants in principal component analysis of population stratification
BACKGROUND: Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the a...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077175/ https://www.ncbi.nlm.nih.gov/pubmed/32183706 http://dx.doi.org/10.1186/s12863-020-0833-x |
_version_ | 1783507372882264064 |
---|---|
author | Ma, Shengqing Shi, Gang |
author_facet | Ma, Shengqing Shi, Gang |
author_sort | Ma, Shengqing |
collection | PubMed |
description | BACKGROUND: Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. RESULTS: We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose F(PC), ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d(2), sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio F(PC) abates, the population distance d(2) decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio F(PC) is 93.85, population distance d(2) is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. CONCLUSIONS: The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data. |
format | Online Article Text |
id | pubmed-7077175 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-70771752020-03-19 On rare variants in principal component analysis of population stratification Ma, Shengqing Shi, Gang BMC Genet Research Article BACKGROUND: Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. RESULTS: We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose F(PC), ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d(2), sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio F(PC) abates, the population distance d(2) decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio F(PC) is 93.85, population distance d(2) is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. CONCLUSIONS: The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data. BioMed Central 2020-03-17 /pmc/articles/PMC7077175/ /pubmed/32183706 http://dx.doi.org/10.1186/s12863-020-0833-x Text en © The Author(s). 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Ma, Shengqing Shi, Gang On rare variants in principal component analysis of population stratification |
title | On rare variants in principal component analysis of population stratification |
title_full | On rare variants in principal component analysis of population stratification |
title_fullStr | On rare variants in principal component analysis of population stratification |
title_full_unstemmed | On rare variants in principal component analysis of population stratification |
title_short | On rare variants in principal component analysis of population stratification |
title_sort | on rare variants in principal component analysis of population stratification |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077175/ https://www.ncbi.nlm.nih.gov/pubmed/32183706 http://dx.doi.org/10.1186/s12863-020-0833-x |
work_keys_str_mv | AT mashengqing onrarevariantsinprincipalcomponentanalysisofpopulationstratification AT shigang onrarevariantsinprincipalcomponentanalysisofpopulationstratification |