Cargando…

GAN-based data augmentation for transcriptomics: survey and comparative assessment

MOTIVATION: Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models’ full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentatio...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lacan, Alice, Sebag, Michèle, Hanczar, Blaise
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Biomedical Informatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311334/ https://www.ncbi.nlm.nih.gov/pubmed/37387181 http://dx.doi.org/10.1093/bioinformatics/btad239

_version_	1785066721488404480
author	Lacan, Alice Sebag, Michèle Hanczar, Blaise
author_facet	Lacan, Alice Sebag, Michèle Hanczar, Blaise
author_sort	Lacan, Alice
collection	PubMed
description	MOTIVATION: Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models’ full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. RESULTS: This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. AVAILABILITY AND IMPLEMENTATION: All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics
format	Online Article Text
id	pubmed-10311334
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-103113342023-07-01 GAN-based data augmentation for transcriptomics: survey and comparative assessment Lacan, Alice Sebag, Michèle Hanczar, Blaise Bioinformatics Biomedical Informatics MOTIVATION: Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models’ full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. RESULTS: This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. AVAILABILITY AND IMPLEMENTATION: All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics Oxford University Press 2023-06-30 /pmc/articles/PMC10311334/ /pubmed/37387181 http://dx.doi.org/10.1093/bioinformatics/btad239 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Biomedical Informatics Lacan, Alice Sebag, Michèle Hanczar, Blaise GAN-based data augmentation for transcriptomics: survey and comparative assessment
title	GAN-based data augmentation for transcriptomics: survey and comparative assessment
title_full	GAN-based data augmentation for transcriptomics: survey and comparative assessment
title_fullStr	GAN-based data augmentation for transcriptomics: survey and comparative assessment
title_full_unstemmed	GAN-based data augmentation for transcriptomics: survey and comparative assessment
title_short	GAN-based data augmentation for transcriptomics: survey and comparative assessment
title_sort	gan-based data augmentation for transcriptomics: survey and comparative assessment
topic	Biomedical Informatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311334/ https://www.ncbi.nlm.nih.gov/pubmed/37387181 http://dx.doi.org/10.1093/bioinformatics/btad239
work_keys_str_mv	AT lacanalice ganbaseddataaugmentationfortranscriptomicssurveyandcomparativeassessment AT sebagmichele ganbaseddataaugmentationfortranscriptomicssurveyandcomparativeassessment AT hanczarblaise ganbaseddataaugmentationfortranscriptomicssurveyandcomparativeassessment

GAN-based data augmentation for transcriptomics: survey and comparative assessment

Ejemplares similares