Cargando…

Statistical significance of variables driving systematic variation in high-dimensional data

Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biolog...

Descripción completa

Detalles Bibliográficos
Autores principales:	Chung, Neo Christopher, Storey, John D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2015
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325543/ https://www.ncbi.nlm.nih.gov/pubmed/25336500 http://dx.doi.org/10.1093/bioinformatics/btu674

_version_	1782356824084709376
author	Chung, Neo Christopher Storey, John D.
author_facet	Chung, Neo Christopher Storey, John D.
author_sort	Chung, Neo Christopher
collection	PubMed
description	Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. Availability and implementation: An R software package, called jackstraw, is available in CRAN. Contact: jstorey@princeton.edu
format	Online Article Text
id	pubmed-4325543
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-43255432015-03-02 Statistical significance of variables driving systematic variation in high-dimensional data Chung, Neo Christopher Storey, John D. Bioinformatics Original Papers Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. Availability and implementation: An R software package, called jackstraw, is available in CRAN. Contact: jstorey@princeton.edu Oxford University Press 2015-02-15 2014-10-21 /pmc/articles/PMC4325543/ /pubmed/25336500 http://dx.doi.org/10.1093/bioinformatics/btu674 Text en © The Author 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Chung, Neo Christopher Storey, John D. Statistical significance of variables driving systematic variation in high-dimensional data
title	Statistical significance of variables driving systematic variation in high-dimensional data
title_full	Statistical significance of variables driving systematic variation in high-dimensional data
title_fullStr	Statistical significance of variables driving systematic variation in high-dimensional data
title_full_unstemmed	Statistical significance of variables driving systematic variation in high-dimensional data
title_short	Statistical significance of variables driving systematic variation in high-dimensional data
title_sort	statistical significance of variables driving systematic variation in high-dimensional data
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4325543/ https://www.ncbi.nlm.nih.gov/pubmed/25336500 http://dx.doi.org/10.1093/bioinformatics/btu674
work_keys_str_mv	AT chungneochristopher statisticalsignificanceofvariablesdrivingsystematicvariationinhighdimensionaldata AT storeyjohnd statisticalsignificanceofvariablesdrivingsystematicvariationinhighdimensionaldata

Statistical significance of variables driving systematic variation in high-dimensional data

Ejemplares similares