Cargando…

A fast least-squares algorithm for population inference

BACKGROUND: Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population membership...

Descripción completa

Detalles Bibliográficos
Autores principales: Parry, R Mitchell, Wang, May D
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3602075/
https://www.ncbi.nlm.nih.gov/pubmed/23343408
http://dx.doi.org/10.1186/1471-2105-14-28
_version_ 1782263530964123648
author Parry, R Mitchell
Wang, May D
author_facet Parry, R Mitchell
Wang, May D
author_sort Parry, R Mitchell
collection PubMed
description BACKGROUND: Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning. RESULTS: We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster. CONCLUSIONS: The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.
format Online
Article
Text
id pubmed-3602075
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36020752013-03-25 A fast least-squares algorithm for population inference Parry, R Mitchell Wang, May D BMC Bioinformatics Research Article BACKGROUND: Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning. RESULTS: We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster. CONCLUSIONS: The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate. BioMed Central 2013-01-23 /pmc/articles/PMC3602075/ /pubmed/23343408 http://dx.doi.org/10.1186/1471-2105-14-28 Text en Copyright ©2013 Parry and Wang; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Parry, R Mitchell
Wang, May D
A fast least-squares algorithm for population inference
title A fast least-squares algorithm for population inference
title_full A fast least-squares algorithm for population inference
title_fullStr A fast least-squares algorithm for population inference
title_full_unstemmed A fast least-squares algorithm for population inference
title_short A fast least-squares algorithm for population inference
title_sort fast least-squares algorithm for population inference
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3602075/
https://www.ncbi.nlm.nih.gov/pubmed/23343408
http://dx.doi.org/10.1186/1471-2105-14-28
work_keys_str_mv AT parryrmitchell afastleastsquaresalgorithmforpopulationinference
AT wangmayd afastleastsquaresalgorithmforpopulationinference
AT parryrmitchell fastleastsquaresalgorithmforpopulationinference
AT wangmayd fastleastsquaresalgorithmforpopulationinference