Cargando…

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations

BACKGROUND: Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most re...

Descripción completa

Detalles Bibliográficos
Autores principales:	Way, Gregory P., Zietz, Michael, Rubinetti, Vincent, Himmelstein, Daniel S., Greene, Casey S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7212571/ https://www.ncbi.nlm.nih.gov/pubmed/32393369 http://dx.doi.org/10.1186/s13059-020-02021-3

_version_	1783531643750842368
author	Way, Gregory P. Zietz, Michael Rubinetti, Vincent Himmelstein, Daniel S. Greene, Casey S.
author_facet	Way, Gregory P. Zietz, Michael Rubinetti, Vincent Himmelstein, Daniel S. Greene, Casey S.
author_sort	Way, Gregory P.
collection	PubMed
description	BACKGROUND: Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS: We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS: There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
format	Online Article Text
id	pubmed-7212571
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-72125712020-05-18 Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations Way, Gregory P. Zietz, Michael Rubinetti, Vincent Himmelstein, Daniel S. Greene, Casey S. Genome Biol Research BACKGROUND: Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS: We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS: There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations. BioMed Central 2020-05-11 /pmc/articles/PMC7212571/ /pubmed/32393369 http://dx.doi.org/10.1186/s13059-020-02021-3 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Way, Gregory P. Zietz, Michael Rubinetti, Vincent Himmelstein, Daniel S. Greene, Casey S. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
title	Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
title_full	Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
title_fullStr	Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
title_full_unstemmed	Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
title_short	Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
title_sort	compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7212571/ https://www.ncbi.nlm.nih.gov/pubmed/32393369 http://dx.doi.org/10.1186/s13059-020-02021-3
work_keys_str_mv	AT waygregoryp compressinggeneexpressiondatausingmultiplelatentspacedimensionalitieslearnscomplementarybiologicalrepresentations AT zietzmichael compressinggeneexpressiondatausingmultiplelatentspacedimensionalitieslearnscomplementarybiologicalrepresentations AT rubinettivincent compressinggeneexpressiondatausingmultiplelatentspacedimensionalitieslearnscomplementarybiologicalrepresentations AT himmelsteindaniels compressinggeneexpressiondatausingmultiplelatentspacedimensionalitieslearnscomplementarybiologicalrepresentations AT greenecaseys compressinggeneexpressiondatausingmultiplelatentspacedimensionalitieslearnscomplementarybiologicalrepresentations

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations

Ejemplares similares