Cargando…

Data-driven human transcriptomic modules determined by independent component analysis

BACKGROUND: Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhou, Weizhuang, Altman, Russ B.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6142401/ https://www.ncbi.nlm.nih.gov/pubmed/30223787 http://dx.doi.org/10.1186/s12859-018-2338-4

_version_	1783355852765265920
author	Zhou, Weizhuang Altman, Russ B.
author_facet	Zhou, Weizhuang Altman, Russ B.
author_sort	Zhou, Weizhuang
collection	PubMed
description	BACKGROUND: Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic analysis is challenging because the data is inherently noisy and high-dimensional. Gene set analysis is currently widely used to alleviate the issue of high dimensionality, but the user-defined choice of gene sets can introduce biasness in results. In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis. We apply independent component analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomic modules that can be used as features for machine learning. We evaluate the usability of these modules across six studies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing with small training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of differentially expressed features. RESULTS: We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs). In studies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, with higher sensitivity (p < 0.01). The models also had higher accuracy and negative predictive value (p < 0.01) for small data sets (less than 30 samples). Additionally, we observed a reduction in batch effects when data is clustered in the FC-space. Finally, we found that differentially expressed FCs mapped to GO terms that were also identified via traditional gene-based approaches. CONCLUSIONS: The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their performance in low sample settings suggest that they should be employed in such studies in order to harness the data efficiently. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2338-4) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6142401
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-61424012018-09-20 Data-driven human transcriptomic modules determined by independent component analysis Zhou, Weizhuang Altman, Russ B. BMC Bioinformatics Research Article BACKGROUND: Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic analysis is challenging because the data is inherently noisy and high-dimensional. Gene set analysis is currently widely used to alleviate the issue of high dimensionality, but the user-defined choice of gene sets can introduce biasness in results. In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis. We apply independent component analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomic modules that can be used as features for machine learning. We evaluate the usability of these modules across six studies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing with small training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of differentially expressed features. RESULTS: We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs). In studies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, with higher sensitivity (p < 0.01). The models also had higher accuracy and negative predictive value (p < 0.01) for small data sets (less than 30 samples). Additionally, we observed a reduction in batch effects when data is clustered in the FC-space. Finally, we found that differentially expressed FCs mapped to GO terms that were also identified via traditional gene-based approaches. CONCLUSIONS: The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their performance in low sample settings suggest that they should be employed in such studies in order to harness the data efficiently. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2338-4) contains supplementary material, which is available to authorized users. BioMed Central 2018-09-17 /pmc/articles/PMC6142401/ /pubmed/30223787 http://dx.doi.org/10.1186/s12859-018-2338-4 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Zhou, Weizhuang Altman, Russ B. Data-driven human transcriptomic modules determined by independent component analysis
title	Data-driven human transcriptomic modules determined by independent component analysis
title_full	Data-driven human transcriptomic modules determined by independent component analysis
title_fullStr	Data-driven human transcriptomic modules determined by independent component analysis
title_full_unstemmed	Data-driven human transcriptomic modules determined by independent component analysis
title_short	Data-driven human transcriptomic modules determined by independent component analysis
title_sort	data-driven human transcriptomic modules determined by independent component analysis
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6142401/ https://www.ncbi.nlm.nih.gov/pubmed/30223787 http://dx.doi.org/10.1186/s12859-018-2338-4
work_keys_str_mv	AT zhouweizhuang datadrivenhumantranscriptomicmodulesdeterminedbyindependentcomponentanalysis AT altmanrussb datadrivenhumantranscriptomicmodulesdeterminedbyindependentcomponentanalysis

Data-driven human transcriptomic modules determined by independent component analysis

Ejemplares similares