Cargando…

Dirichlet Diffusion Score Model for Biological Sequence Generation

Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differe...

Descripción completa

Detalles Bibliográficos
Autores principales: Avdeyev, Pavel, Shi, Chenlai, Tan, Yuhao, Dudnyk, Kseniia, Zhou, Jian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cornell University 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246113/
https://www.ncbi.nlm.nih.gov/pubmed/37292476
_version_ 1785054980404674560
author Avdeyev, Pavel
Shi, Chenlai
Tan, Yuhao
Dudnyk, Kseniia
Zhou, Jian
author_facet Avdeyev, Pavel
Shi, Chenlai
Tan, Yuhao
Dudnyk, Kseniia
Zhou, Jian
author_sort Avdeyev, Pavel
collection PubMed
description Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.
format Online
Article
Text
id pubmed-10246113
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cornell University
record_format MEDLINE/PubMed
spelling pubmed-102461132023-06-08 Dirichlet Diffusion Score Model for Biological Sequence Generation Avdeyev, Pavel Shi, Chenlai Tan, Yuhao Dudnyk, Kseniia Zhou, Jian ArXiv Article Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences. Cornell University 2023-06-16 /pmc/articles/PMC10246113/ /pubmed/37292476 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Avdeyev, Pavel
Shi, Chenlai
Tan, Yuhao
Dudnyk, Kseniia
Zhou, Jian
Dirichlet Diffusion Score Model for Biological Sequence Generation
title Dirichlet Diffusion Score Model for Biological Sequence Generation
title_full Dirichlet Diffusion Score Model for Biological Sequence Generation
title_fullStr Dirichlet Diffusion Score Model for Biological Sequence Generation
title_full_unstemmed Dirichlet Diffusion Score Model for Biological Sequence Generation
title_short Dirichlet Diffusion Score Model for Biological Sequence Generation
title_sort dirichlet diffusion score model for biological sequence generation
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10246113/
https://www.ncbi.nlm.nih.gov/pubmed/37292476
work_keys_str_mv AT avdeyevpavel dirichletdiffusionscoremodelforbiologicalsequencegeneration
AT shichenlai dirichletdiffusionscoremodelforbiologicalsequencegeneration
AT tanyuhao dirichletdiffusionscoremodelforbiologicalsequencegeneration
AT dudnykkseniia dirichletdiffusionscoremodelforbiologicalsequencegeneration
AT zhoujian dirichletdiffusionscoremodelforbiologicalsequencegeneration