Cargando…

PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations

Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on gen...

Descripción completa

Detalles Bibliográficos
Autores principales: Paschou, Peristera, Ziv, Elad, Burchard, Esteban G, Choudhry, Shweta, Rodriguez-Cintron, William, Mahoney, Michael W, Drineas, Petros
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1988848/
https://www.ncbi.nlm.nih.gov/pubmed/17892327
http://dx.doi.org/10.1371/journal.pgen.0030160
_version_ 1782135430717636608
author Paschou, Peristera
Ziv, Elad
Burchard, Esteban G
Choudhry, Shweta
Rodriguez-Cintron, William
Mahoney, Michael W
Drineas, Petros
author_facet Paschou, Peristera
Ziv, Elad
Burchard, Esteban G
Choudhry, Shweta
Rodriguez-Cintron, William
Mahoney, Michael W
Drineas, Petros
author_sort Paschou, Peristera
collection PubMed
description Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations.
format Text
id pubmed-1988848
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-19888482007-09-27 PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations Paschou, Peristera Ziv, Elad Burchard, Esteban G Choudhry, Shweta Rodriguez-Cintron, William Mahoney, Michael W Drineas, Petros PLoS Genet Research Article Existing methods to ascertain small sets of markers for the identification of human population structure require prior knowledge of individual ancestry. Based on Principal Components Analysis (PCA), and recent results in theoretical computer science, we present a novel algorithm that, applied on genomewide data, selects small subsets of SNPs (PCA-correlated SNPs) to reproduce the structure found by PCA on the complete dataset, without use of ancestry information. Evaluating our method on a previously described dataset (10,805 SNPs, 11 populations), we demonstrate that a very small set of PCA-correlated SNPs can be effectively employed to assign individuals to particular continents or populations, using a simple clustering algorithm. We validate our methods on the HapMap populations and achieve perfect intercontinental differentiation with 14 PCA-correlated SNPs. The Chinese and Japanese populations can be easily differentiated using less than 100 PCA-correlated SNPs ascertained after evaluating 1.7 million SNPs from HapMap. We show that, in general, structure informative SNPs are not portable across geographic regions. However, we manage to identify a general set of 50 PCA-correlated SNPs that effectively assigns individuals to one of nine different populations. Compared to analysis with the measure of informativeness, our methods, although unsupervised, achieved similar results. We proceed to demonstrate that our algorithm can be effectively used for the analysis of admixed populations without having to trace the origin of individuals. Analyzing a Puerto Rican dataset (192 individuals, 7,257 SNPs), we show that PCA-correlated SNPs can be used to successfully predict structure and ancestry proportions. We subsequently validate these SNPs for structure identification in an independent Puerto Rican dataset. The algorithm that we introduce runs in seconds and can be easily applied on large genome-wide datasets, facilitating the identification of population substructure, stratification assessment in multi-stage whole-genome association studies, and the study of demographic history in human populations. Public Library of Science 2007-09 2007-09-21 /pmc/articles/PMC1988848/ /pubmed/17892327 http://dx.doi.org/10.1371/journal.pgen.0030160 Text en Copyright: © 2007 Paschou et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Paschou, Peristera
Ziv, Elad
Burchard, Esteban G
Choudhry, Shweta
Rodriguez-Cintron, William
Mahoney, Michael W
Drineas, Petros
PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations
title PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations
title_full PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations
title_fullStr PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations
title_full_unstemmed PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations
title_short PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations
title_sort pca-correlated snps for structure identification in worldwide human populations
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1988848/
https://www.ncbi.nlm.nih.gov/pubmed/17892327
http://dx.doi.org/10.1371/journal.pgen.0030160
work_keys_str_mv AT paschouperistera pcacorrelatedsnpsforstructureidentificationinworldwidehumanpopulations
AT zivelad pcacorrelatedsnpsforstructureidentificationinworldwidehumanpopulations
AT burchardestebang pcacorrelatedsnpsforstructureidentificationinworldwidehumanpopulations
AT choudhryshweta pcacorrelatedsnpsforstructureidentificationinworldwidehumanpopulations
AT rodriguezcintronwilliam pcacorrelatedsnpsforstructureidentificationinworldwidehumanpopulations
AT mahoneymichaelw pcacorrelatedsnpsforstructureidentificationinworldwidehumanpopulations
AT drineaspetros pcacorrelatedsnpsforstructureidentificationinworldwidehumanpopulations