Cargando…
Tractable and Expressive Generative Models of Genetic Variation Data
Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gain...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245670/ https://www.ncbi.nlm.nih.gov/pubmed/37292742 http://dx.doi.org/10.1101/2023.05.16.541036 |
_version_ | 1785054905810026496 |
---|---|
author | Dang, Meihua Liu, Anji Wei, Xinzhu Sankararaman, Sriram Van den Broeck, Guy |
author_facet | Dang, Meihua Liu, Anji Wei, Xinzhu Sankararaman, Sriram Van den Broeck, Guy |
author_sort | Dang, Meihua |
collection | PubMed |
description | Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability. Here, we propose to use hidden Chow-Liu trees (HCLTs) and their representation as probabilistic circuits (PCs) as a solution to this tradeoff. We first learn an HCLT structure that captures the long-range dependencies among SNPs in the training data set. We then convert the HCLT to its equivalent PC as a means of supporting tractable and efficient probabilistic inference. The parameters in these PCs are inferred with an expectation-maximization algorithm using the training data. Compared to other models for generating AGs, HCLT obtains the largest log-likelihood on test genomes across SNPs chosen across the genome and from a contiguous genomic region. Moreover, the AGs generated by HCLT more accurately resemble the source data set in their patterns of allele frequencies, linkage disequilibrium, pairwise haplotype distances, and population structure. This work not only presents a new and robust AG simulator but also manifests the potential of PCs in population genetics. |
format | Online Article Text |
id | pubmed-10245670 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-102456702023-06-08 Tractable and Expressive Generative Models of Genetic Variation Data Dang, Meihua Liu, Anji Wei, Xinzhu Sankararaman, Sriram Van den Broeck, Guy bioRxiv Article Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability. Here, we propose to use hidden Chow-Liu trees (HCLTs) and their representation as probabilistic circuits (PCs) as a solution to this tradeoff. We first learn an HCLT structure that captures the long-range dependencies among SNPs in the training data set. We then convert the HCLT to its equivalent PC as a means of supporting tractable and efficient probabilistic inference. The parameters in these PCs are inferred with an expectation-maximization algorithm using the training data. Compared to other models for generating AGs, HCLT obtains the largest log-likelihood on test genomes across SNPs chosen across the genome and from a contiguous genomic region. Moreover, the AGs generated by HCLT more accurately resemble the source data set in their patterns of allele frequencies, linkage disequilibrium, pairwise haplotype distances, and population structure. This work not only presents a new and robust AG simulator but also manifests the potential of PCs in population genetics. Cold Spring Harbor Laboratory 2023-05-18 /pmc/articles/PMC10245670/ /pubmed/37292742 http://dx.doi.org/10.1101/2023.05.16.541036 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator. |
spellingShingle | Article Dang, Meihua Liu, Anji Wei, Xinzhu Sankararaman, Sriram Van den Broeck, Guy Tractable and Expressive Generative Models of Genetic Variation Data |
title | Tractable and Expressive Generative Models of Genetic Variation Data |
title_full | Tractable and Expressive Generative Models of Genetic Variation Data |
title_fullStr | Tractable and Expressive Generative Models of Genetic Variation Data |
title_full_unstemmed | Tractable and Expressive Generative Models of Genetic Variation Data |
title_short | Tractable and Expressive Generative Models of Genetic Variation Data |
title_sort | tractable and expressive generative models of genetic variation data |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245670/ https://www.ncbi.nlm.nih.gov/pubmed/37292742 http://dx.doi.org/10.1101/2023.05.16.541036 |
work_keys_str_mv | AT dangmeihua tractableandexpressivegenerativemodelsofgeneticvariationdata AT liuanji tractableandexpressivegenerativemodelsofgeneticvariationdata AT weixinzhu tractableandexpressivegenerativemodelsofgeneticvariationdata AT sankararamansriram tractableandexpressivegenerativemodelsofgeneticvariationdata AT vandenbroeckguy tractableandexpressivegenerativemodelsofgeneticvariationdata |