Cargando…

Scaling the Discrete-time Wright Fisher model to biobank-scale datasets

The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection...

Descripción completa

Detalles Bibliográficos
Autores principales:	Spence, Jeffrey P., Zeng, Tony, Mostafavi, Hakhamanesh, Pritchard, Jonathan K.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245735/ https://www.ncbi.nlm.nih.gov/pubmed/37293115 http://dx.doi.org/10.1101/2023.05.19.541517

_version_	1785054918160154624
author	Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K.
author_facet	Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K.
author_sort	Spence, Jeffrey P.
collection	PubMed
description	The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
format	Online Article Text
id	pubmed-10245735
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory
record_format	MEDLINE/PubMed
spelling	pubmed-102457352023-06-08 Scaling the Discrete-time Wright Fisher model to biobank-scale datasets Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K. bioRxiv Article The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects. Cold Spring Harbor Laboratory 2023-05-22 /pmc/articles/PMC10245735/ /pubmed/37293115 http://dx.doi.org/10.1101/2023.05.19.541517 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle	Article Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K. Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title	Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_full	Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_fullStr	Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_full_unstemmed	Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_short	Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_sort	scaling the discrete-time wright fisher model to biobank-scale datasets
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245735/ https://www.ncbi.nlm.nih.gov/pubmed/37293115 http://dx.doi.org/10.1101/2023.05.19.541517
work_keys_str_mv	AT spencejeffreyp scalingthediscretetimewrightfishermodeltobiobankscaledatasets AT zengtony scalingthediscretetimewrightfishermodeltobiobankscaledatasets AT mostafavihakhamanesh scalingthediscretetimewrightfishermodeltobiobankscaledatasets AT pritchardjonathank scalingthediscretetimewrightfishermodeltobiobankscaledatasets

Scaling the Discrete-time Wright Fisher model to biobank-scale datasets

Ejemplares similares