Cargando…

FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data

BACKGROUND: Identifying subpopulations within a study and inferring intercontinental ancestry of the samples are important steps in genome wide association studies. Two software packages are widely used in analysis of substructure: Structure and Eigenstrat. Structure assigns each individual to a pop...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yafang, Byun, Jinyoung, Cai, Guoshuai, Xiao, Xiangjun, Han, Younghun, Cornelis, Olivier, Dinulos, James E., Dennis, Joe, Easton, Douglas, Gorlov, Ivan, Seldin, Michael F., Amos, Christopher I.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784403/
https://www.ncbi.nlm.nih.gov/pubmed/26961892
http://dx.doi.org/10.1186/s12859-016-0965-1
_version_ 1782420262246940672
author Li, Yafang
Byun, Jinyoung
Cai, Guoshuai
Xiao, Xiangjun
Han, Younghun
Cornelis, Olivier
Dinulos, James E.
Dennis, Joe
Easton, Douglas
Gorlov, Ivan
Seldin, Michael F.
Amos, Christopher I.
author_facet Li, Yafang
Byun, Jinyoung
Cai, Guoshuai
Xiao, Xiangjun
Han, Younghun
Cornelis, Olivier
Dinulos, James E.
Dennis, Joe
Easton, Douglas
Gorlov, Ivan
Seldin, Michael F.
Amos, Christopher I.
author_sort Li, Yafang
collection PubMed
description BACKGROUND: Identifying subpopulations within a study and inferring intercontinental ancestry of the samples are important steps in genome wide association studies. Two software packages are widely used in analysis of substructure: Structure and Eigenstrat. Structure assigns each individual to a population by using a Bayesian method with multiple tuning parameters. It requires considerable computational time when dealing with thousands of samples and lacks the ability to create scores that could be used as covariates. Eigenstrat uses a principal component analysis method to model all sources of sampling variation. However, it does not readily provide information directly relevant to ancestral origin; the eigenvectors generated by Eigenstrat are sample specific and thus cannot be generalized to other individuals. RESULTS: We developed FastPop, an efficient R package that fills the gap between Structure and Eigenstrat. It can: 1, generate PCA scores that identify ancestral origins and can be used for multiple studies; 2, infer ancestry information for data arising from two or more intercontinental origins. We demonstrate the use of FastPop using 2318 SNP markers selected from the genome based on high variability among European, Asian and West African (African) populations. We conducted an analysis of 505 Hapmap samples with European, African or Asian ancestry along with 19661 additional samples of unknown ancestry. The results from FastPop are highly consistent with those obtained by Structure across the 19661 samples we studied. The correlations of the results between FastPop and Structure are 0.99, 0.97 and 0.99 for European, African and Asian ancestry scores, respectively. Compared with Structure, FastPop is more efficient as it finished ancestry inference for 19661 samples in 16 min compared with 21–24 h required by Structure. FastPop also provided scores based on SNP weights so the scores of reference population can be applied to other studies provided the same set of markers are used. We also present application of the method for studying four continental populations (European, Asian, African, and Native American). CONCLUSIONS: We developed an algorithm that can infer ancestries on data involving two or more intercontinental origins. It is efficient for analyzing large datasets. Additionally the PCA derived scores can be applied to multiple data sets to ensure the same ancestry analysis is applied to all studies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0965-1) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4784403
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47844032016-03-10 FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data Li, Yafang Byun, Jinyoung Cai, Guoshuai Xiao, Xiangjun Han, Younghun Cornelis, Olivier Dinulos, James E. Dennis, Joe Easton, Douglas Gorlov, Ivan Seldin, Michael F. Amos, Christopher I. BMC Bioinformatics Software BACKGROUND: Identifying subpopulations within a study and inferring intercontinental ancestry of the samples are important steps in genome wide association studies. Two software packages are widely used in analysis of substructure: Structure and Eigenstrat. Structure assigns each individual to a population by using a Bayesian method with multiple tuning parameters. It requires considerable computational time when dealing with thousands of samples and lacks the ability to create scores that could be used as covariates. Eigenstrat uses a principal component analysis method to model all sources of sampling variation. However, it does not readily provide information directly relevant to ancestral origin; the eigenvectors generated by Eigenstrat are sample specific and thus cannot be generalized to other individuals. RESULTS: We developed FastPop, an efficient R package that fills the gap between Structure and Eigenstrat. It can: 1, generate PCA scores that identify ancestral origins and can be used for multiple studies; 2, infer ancestry information for data arising from two or more intercontinental origins. We demonstrate the use of FastPop using 2318 SNP markers selected from the genome based on high variability among European, Asian and West African (African) populations. We conducted an analysis of 505 Hapmap samples with European, African or Asian ancestry along with 19661 additional samples of unknown ancestry. The results from FastPop are highly consistent with those obtained by Structure across the 19661 samples we studied. The correlations of the results between FastPop and Structure are 0.99, 0.97 and 0.99 for European, African and Asian ancestry scores, respectively. Compared with Structure, FastPop is more efficient as it finished ancestry inference for 19661 samples in 16 min compared with 21–24 h required by Structure. FastPop also provided scores based on SNP weights so the scores of reference population can be applied to other studies provided the same set of markers are used. We also present application of the method for studying four continental populations (European, Asian, African, and Native American). CONCLUSIONS: We developed an algorithm that can infer ancestries on data involving two or more intercontinental origins. It is efficient for analyzing large datasets. Additionally the PCA derived scores can be applied to multiple data sets to ensure the same ancestry analysis is applied to all studies. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0965-1) contains supplementary material, which is available to authorized users. BioMed Central 2016-03-09 /pmc/articles/PMC4784403/ /pubmed/26961892 http://dx.doi.org/10.1186/s12859-016-0965-1 Text en © Li et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Li, Yafang
Byun, Jinyoung
Cai, Guoshuai
Xiao, Xiangjun
Han, Younghun
Cornelis, Olivier
Dinulos, James E.
Dennis, Joe
Easton, Douglas
Gorlov, Ivan
Seldin, Michael F.
Amos, Christopher I.
FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data
title FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data
title_full FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data
title_fullStr FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data
title_full_unstemmed FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data
title_short FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data
title_sort fastpop: a rapid principal component derived method to infer intercontinental ancestry using genetic data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4784403/
https://www.ncbi.nlm.nih.gov/pubmed/26961892
http://dx.doi.org/10.1186/s12859-016-0965-1
work_keys_str_mv AT liyafang fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT byunjinyoung fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT caiguoshuai fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT xiaoxiangjun fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT hanyounghun fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT cornelisolivier fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT dinulosjamese fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT dennisjoe fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT eastondouglas fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT gorlovivan fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT seldinmichaelf fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata
AT amoschristopheri fastpoparapidprincipalcomponentderivedmethodtoinferintercontinentalancestryusinggeneticdata