Cargando…

Tractable and Expressive Generative Models of Genetic Variation Data

Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gain...

Descripción completa

Detalles Bibliográficos
Autores principales: Dang, Meihua, Liu, Anji, Wei, Xinzhu, Sankararaman, Sriram, Van den Broeck, Guy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245670/
https://www.ncbi.nlm.nih.gov/pubmed/37292742
http://dx.doi.org/10.1101/2023.05.16.541036
_version_ 1785054905810026496
author Dang, Meihua
Liu, Anji
Wei, Xinzhu
Sankararaman, Sriram
Van den Broeck, Guy
author_facet Dang, Meihua
Liu, Anji
Wei, Xinzhu
Sankararaman, Sriram
Van den Broeck, Guy
author_sort Dang, Meihua
collection PubMed
description Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability. Here, we propose to use hidden Chow-Liu trees (HCLTs) and their representation as probabilistic circuits (PCs) as a solution to this tradeoff. We first learn an HCLT structure that captures the long-range dependencies among SNPs in the training data set. We then convert the HCLT to its equivalent PC as a means of supporting tractable and efficient probabilistic inference. The parameters in these PCs are inferred with an expectation-maximization algorithm using the training data. Compared to other models for generating AGs, HCLT obtains the largest log-likelihood on test genomes across SNPs chosen across the genome and from a contiguous genomic region. Moreover, the AGs generated by HCLT more accurately resemble the source data set in their patterns of allele frequencies, linkage disequilibrium, pairwise haplotype distances, and population structure. This work not only presents a new and robust AG simulator but also manifests the potential of PCs in population genetics.
format Online
Article
Text
id pubmed-10245670
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-102456702023-06-08 Tractable and Expressive Generative Models of Genetic Variation Data Dang, Meihua Liu, Anji Wei, Xinzhu Sankararaman, Sriram Van den Broeck, Guy bioRxiv Article Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability. Here, we propose to use hidden Chow-Liu trees (HCLTs) and their representation as probabilistic circuits (PCs) as a solution to this tradeoff. We first learn an HCLT structure that captures the long-range dependencies among SNPs in the training data set. We then convert the HCLT to its equivalent PC as a means of supporting tractable and efficient probabilistic inference. The parameters in these PCs are inferred with an expectation-maximization algorithm using the training data. Compared to other models for generating AGs, HCLT obtains the largest log-likelihood on test genomes across SNPs chosen across the genome and from a contiguous genomic region. Moreover, the AGs generated by HCLT more accurately resemble the source data set in their patterns of allele frequencies, linkage disequilibrium, pairwise haplotype distances, and population structure. This work not only presents a new and robust AG simulator but also manifests the potential of PCs in population genetics. Cold Spring Harbor Laboratory 2023-05-18 /pmc/articles/PMC10245670/ /pubmed/37292742 http://dx.doi.org/10.1101/2023.05.16.541036 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Dang, Meihua
Liu, Anji
Wei, Xinzhu
Sankararaman, Sriram
Van den Broeck, Guy
Tractable and Expressive Generative Models of Genetic Variation Data
title Tractable and Expressive Generative Models of Genetic Variation Data
title_full Tractable and Expressive Generative Models of Genetic Variation Data
title_fullStr Tractable and Expressive Generative Models of Genetic Variation Data
title_full_unstemmed Tractable and Expressive Generative Models of Genetic Variation Data
title_short Tractable and Expressive Generative Models of Genetic Variation Data
title_sort tractable and expressive generative models of genetic variation data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10245670/
https://www.ncbi.nlm.nih.gov/pubmed/37292742
http://dx.doi.org/10.1101/2023.05.16.541036
work_keys_str_mv AT dangmeihua tractableandexpressivegenerativemodelsofgeneticvariationdata
AT liuanji tractableandexpressivegenerativemodelsofgeneticvariationdata
AT weixinzhu tractableandexpressivegenerativemodelsofgeneticvariationdata
AT sankararamansriram tractableandexpressivegenerativemodelsofgeneticvariationdata
AT vandenbroeckguy tractableandexpressivegenerativemodelsofgeneticvariationdata