Cargando…
Synthetic single cell RNA sequencing data from small pilot studies using deep generative models
Deep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBMs), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087667/ https://www.ncbi.nlm.nih.gov/pubmed/33931726 http://dx.doi.org/10.1038/s41598-021-88875-4 |
_version_ | 1783686704413016064 |
---|---|
author | Treppner, Martin Salas-Bastos, Adrián Hess, Moritz Lenz, Stefan Vogel, Tanja Binder, Harald |
author_facet | Treppner, Martin Salas-Bastos, Adrián Hess, Moritz Lenz, Stefan Vogel, Tanja Binder, Harald |
author_sort | Treppner, Martin |
collection | PubMed |
description | Deep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBMs), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale experiment by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps. We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBMs). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the [Formula: see text] variant resulted in high variability, most likely due to amplifying artifacts of small datasets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Across all analyses, in comparing 10[Formula: see text] Genomics and Smart-seq2 technologies, we could show that for 10[Formula: see text] datasets, which have higher sparsity, it is more challenging to make inference from small to larger datasets. Overall, the results show that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments. |
format | Online Article Text |
id | pubmed-8087667 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-80876672021-05-03 Synthetic single cell RNA sequencing data from small pilot studies using deep generative models Treppner, Martin Salas-Bastos, Adrián Hess, Moritz Lenz, Stefan Vogel, Tanja Binder, Harald Sci Rep Article Deep generative models, such as variational autoencoders (VAEs) or deep Boltzmann machines (DBMs), can generate an arbitrary number of synthetic observations after being trained on an initial set of samples. This has mainly been investigated for imaging data but could also be useful for single-cell transcriptomics (scRNA-seq). A small pilot study could be used for planning a full-scale experiment by investigating planned analysis strategies on synthetic data with different sample sizes. It is unclear whether synthetic observations generated based on a small scRNA-seq dataset reflect the properties relevant for subsequent data analysis steps. We specifically investigated two deep generative modeling approaches, VAEs and DBMs. First, we considered single-cell variational inference (scVI) in two variants, generating samples from the posterior distribution, the standard approach, or the prior distribution. Second, we propose single-cell deep Boltzmann machines (scDBMs). When considering the similarity of clustering results on synthetic data to ground-truth clustering, we find that the [Formula: see text] variant resulted in high variability, most likely due to amplifying artifacts of small datasets. All approaches showed mixed results for cell types with different abundance by overrepresenting highly abundant cell types and missing less abundant cell types. With increasing pilot dataset sizes, the proportions of the cells in each cluster became more similar to that of ground-truth data. We also showed that all approaches learn the univariate distribution of most genes, but problems occurred with bimodality. Across all analyses, in comparing 10[Formula: see text] Genomics and Smart-seq2 technologies, we could show that for 10[Formula: see text] datasets, which have higher sparsity, it is more challenging to make inference from small to larger datasets. Overall, the results show that generative deep learning approaches might be valuable for supporting the design of scRNA-seq experiments. Nature Publishing Group UK 2021-04-30 /pmc/articles/PMC8087667/ /pubmed/33931726 http://dx.doi.org/10.1038/s41598-021-88875-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Treppner, Martin Salas-Bastos, Adrián Hess, Moritz Lenz, Stefan Vogel, Tanja Binder, Harald Synthetic single cell RNA sequencing data from small pilot studies using deep generative models |
title | Synthetic single cell RNA sequencing data from small pilot studies using deep generative models |
title_full | Synthetic single cell RNA sequencing data from small pilot studies using deep generative models |
title_fullStr | Synthetic single cell RNA sequencing data from small pilot studies using deep generative models |
title_full_unstemmed | Synthetic single cell RNA sequencing data from small pilot studies using deep generative models |
title_short | Synthetic single cell RNA sequencing data from small pilot studies using deep generative models |
title_sort | synthetic single cell rna sequencing data from small pilot studies using deep generative models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8087667/ https://www.ncbi.nlm.nih.gov/pubmed/33931726 http://dx.doi.org/10.1038/s41598-021-88875-4 |
work_keys_str_mv | AT treppnermartin syntheticsinglecellrnasequencingdatafromsmallpilotstudiesusingdeepgenerativemodels AT salasbastosadrian syntheticsinglecellrnasequencingdatafromsmallpilotstudiesusingdeepgenerativemodels AT hessmoritz syntheticsinglecellrnasequencingdatafromsmallpilotstudiesusingdeepgenerativemodels AT lenzstefan syntheticsinglecellrnasequencingdatafromsmallpilotstudiesusingdeepgenerativemodels AT vogeltanja syntheticsinglecellrnasequencingdatafromsmallpilotstudiesusingdeepgenerativemodels AT binderharald syntheticsinglecellrnasequencingdatafromsmallpilotstudiesusingdeepgenerativemodels |