Cargando…

Deep convolutional and conditional neural networks for large-scale genomic data generation

Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and...

Descripción completa

Detalles Bibliográficos
Autores principales: Yelmen, Burak, Decelle, Aurélien, Boulos, Leila Lea, Szatkownik, Antoine, Furtlehner, Cyril, Charpiat, Guillaume, Jay, Flora
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635570/
https://www.ncbi.nlm.nih.gov/pubmed/37903158
http://dx.doi.org/10.1371/journal.pcbi.1011584
_version_ 1785133025673084928
author Yelmen, Burak
Decelle, Aurélien
Boulos, Leila Lea
Szatkownik, Antoine
Furtlehner, Cyril
Charpiat, Guillaume
Jay, Flora
author_facet Yelmen, Burak
Decelle, Aurélien
Boulos, Leila Lea
Szatkownik, Antoine
Furtlehner, Cyril
Charpiat, Guillaume
Jay, Flora
author_sort Yelmen, Burak
collection PubMed
description Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.
format Online
Article
Text
id pubmed-10635570
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-106355702023-11-10 Deep convolutional and conditional neural networks for large-scale genomic data generation Yelmen, Burak Decelle, Aurélien Boulos, Leila Lea Szatkownik, Antoine Furtlehner, Cyril Charpiat, Guillaume Jay, Flora PLoS Comput Biol Research Article Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy. Public Library of Science 2023-10-30 /pmc/articles/PMC10635570/ /pubmed/37903158 http://dx.doi.org/10.1371/journal.pcbi.1011584 Text en © 2023 Yelmen et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Yelmen, Burak
Decelle, Aurélien
Boulos, Leila Lea
Szatkownik, Antoine
Furtlehner, Cyril
Charpiat, Guillaume
Jay, Flora
Deep convolutional and conditional neural networks for large-scale genomic data generation
title Deep convolutional and conditional neural networks for large-scale genomic data generation
title_full Deep convolutional and conditional neural networks for large-scale genomic data generation
title_fullStr Deep convolutional and conditional neural networks for large-scale genomic data generation
title_full_unstemmed Deep convolutional and conditional neural networks for large-scale genomic data generation
title_short Deep convolutional and conditional neural networks for large-scale genomic data generation
title_sort deep convolutional and conditional neural networks for large-scale genomic data generation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635570/
https://www.ncbi.nlm.nih.gov/pubmed/37903158
http://dx.doi.org/10.1371/journal.pcbi.1011584
work_keys_str_mv AT yelmenburak deepconvolutionalandconditionalneuralnetworksforlargescalegenomicdatageneration
AT decelleaurelien deepconvolutionalandconditionalneuralnetworksforlargescalegenomicdatageneration
AT boulosleilalea deepconvolutionalandconditionalneuralnetworksforlargescalegenomicdatageneration
AT szatkownikantoine deepconvolutionalandconditionalneuralnetworksforlargescalegenomicdatageneration
AT furtlehnercyril deepconvolutionalandconditionalneuralnetworksforlargescalegenomicdatageneration
AT charpiatguillaume deepconvolutionalandconditionalneuralnetworksforlargescalegenomicdatageneration
AT jayflora deepconvolutionalandconditionalneuralnetworksforlargescalegenomicdatageneration