Cargando…
Scaling the Discrete-time Wright Fisher model to biobank-scale datasets
The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245735/ https://www.ncbi.nlm.nih.gov/pubmed/37293115 http://dx.doi.org/10.1101/2023.05.19.541517 |
_version_ | 1785054918160154624 |
---|---|
author | Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K. |
author_facet | Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K. |
author_sort | Spence, Jeffrey P. |
collection | PubMed |
description | The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects. |
format | Online Article Text |
id | pubmed-10245735 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-102457352023-06-08 Scaling the Discrete-time Wright Fisher model to biobank-scale datasets Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K. bioRxiv Article The Discrete-Time Wright Fisher (DTWF) model and its large population diffusion limit are central to population genetics. These models describe the forward-in-time evolution of the frequency of an allele in a population and can include the fundamental forces of genetic drift, mutation, and selection. Computing likelihoods under the diffusion process is feasible, but the diffusion approximation breaks down for large sample sizes or in the presence of strong selection. Unfortunately, existing methods for computing likelihoods under the DTWF model do not scale to current exome sequencing sample sizes in the hundreds of thousands. Here we present an algorithm that approximates the DTWF model with provably bounded error and runs in time linear in the size of the population. Our approach relies on two key observations about Binomial distributions. The first is that Binomial distributions are approximately sparse. The second is that Binomial distributions with similar success probabilities are extremely close as distributions, allowing us to approximate the DTWF Markov transition matrix as a very low rank matrix. Together, these observations enable matrix-vector multiplication in linear (as opposed to the usual quadratic) time. We prove similar properties for Hypergeometric distributions, enabling fast computation of likelihoods for subsamples of the population. We show theoretically and in practice that this approximation is highly accurate and can scale to population sizes in the billions, paving the way for rigorous biobank-scale population genetic inference. Finally, we use our results to estimate how increasing sample sizes will improve the estimation of selection coefficients acting on loss-of-function variants. We find that increasing sample sizes beyond existing large exome sequencing cohorts will provide essentially no additional information except for genes with the most extreme fitness effects. Cold Spring Harbor Laboratory 2023-05-22 /pmc/articles/PMC10245735/ /pubmed/37293115 http://dx.doi.org/10.1101/2023.05.19.541517 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Spence, Jeffrey P. Zeng, Tony Mostafavi, Hakhamanesh Pritchard, Jonathan K. Scaling the Discrete-time Wright Fisher model to biobank-scale datasets |
title | Scaling the Discrete-time Wright Fisher model to biobank-scale datasets |
title_full | Scaling the Discrete-time Wright Fisher model to biobank-scale datasets |
title_fullStr | Scaling the Discrete-time Wright Fisher model to biobank-scale datasets |
title_full_unstemmed | Scaling the Discrete-time Wright Fisher model to biobank-scale datasets |
title_short | Scaling the Discrete-time Wright Fisher model to biobank-scale datasets |
title_sort | scaling the discrete-time wright fisher model to biobank-scale datasets |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245735/ https://www.ncbi.nlm.nih.gov/pubmed/37293115 http://dx.doi.org/10.1101/2023.05.19.541517 |
work_keys_str_mv | AT spencejeffreyp scalingthediscretetimewrightfishermodeltobiobankscaledatasets AT zengtony scalingthediscretetimewrightfishermodeltobiobankscaledatasets AT mostafavihakhamanesh scalingthediscretetimewrightfishermodeltobiobankscaledatasets AT pritchardjonathank scalingthediscretetimewrightfishermodeltobiobankscaledatasets |