Cargando…

Scaling the Discrete-time Wright Fisher model to biobank-scale datasets

The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection...

Descripción completa

Detalles Bibliográficos
Autores principales: Spence, Jeffrey P., Zeng, Tony, Mostafavi, Hakhamanesh, Pritchard, Jonathan K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245735/
https://www.ncbi.nlm.nih.gov/pubmed/37293115
http://dx.doi.org/10.1101/2023.05.19.541517
_version_ 1785054918160154624
author Spence, Jeffrey P.
Zeng, Tony
Mostafavi, Hakhamanesh
Pritchard, Jonathan K.
author_facet Spence, Jeffrey P.
Zeng, Tony
Mostafavi, Hakhamanesh
Pritchard, Jonathan K.
author_sort Spence, Jeffrey P.
collection PubMed
description The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects.
format Online
Article
Text
id pubmed-10245735
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-102457352023-06-08 Scaling the Discrete-time Wright Fisher model to biobank-scale datasets Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K. bioRxiv Article The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects. Cold Spring Harbor Laboratory 2023-05-22 /pmc/articles/PMC10245735/ /pubmed/37293115 http://dx.doi.org/10.1101/2023.05.19.541517 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Spence, Jeffrey P.
Zeng, Tony
Mostafavi, Hakhamanesh
Pritchard, Jonathan K.
Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_full Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_fullStr Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_full_unstemmed Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_short Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
title_sort scaling the discrete-time wright fisher model to biobank-scale datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245735/
https://www.ncbi.nlm.nih.gov/pubmed/37293115
http://dx.doi.org/10.1101/2023.05.19.541517
work_keys_str_mv AT spencejeffreyp scalingthediscretetimewrightfishermodeltobiobankscaledatasets
AT zengtony scalingthediscretetimewrightfishermodeltobiobankscaledatasets
AT mostafavihakhamanesh scalingthediscretetimewrightfishermodeltobiobankscaledatasets
AT pritchardjonathank scalingthediscretetimewrightfishermodeltobiobankscaledatasets