Cargando…

Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data

Motivation: Principal component analysis (PCA) is a basic tool often used in bioinformatics for visualization and dimension reduction. However, it is known that PCA may not consistently estimate the true direction of maximal variability in high-dimensional, low sample size settings, which are typica...

Descripción completa

Detalles Bibliográficos
Autores principales: Sill, Martin, Saadati, Maral, Benner, Axel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4528629/
https://www.ncbi.nlm.nih.gov/pubmed/25861969
http://dx.doi.org/10.1093/bioinformatics/btv197
_version_ 1782384691102351360
author Sill, Martin
Saadati, Maral
Benner, Axel
author_facet Sill, Martin
Saadati, Maral
Benner, Axel
author_sort Sill, Martin
collection PubMed
description Motivation: Principal component analysis (PCA) is a basic tool often used in bioinformatics for visualization and dimension reduction. However, it is known that PCA may not consistently estimate the true direction of maximal variability in high-dimensional, low sample size settings, which are typical for molecular data. Assuming that the underlying signal is sparse, i.e. that only a fraction of features contribute to a principal component (PC), this estimation consistency can be retained. Most existing sparse PCA methods use L1-penalization, i.e. the lasso, to perform feature selection. But, the lasso is known to lack variable selection consistency in high dimensions and therefore a subsequent interpretation of selected features can give misleading results. Results: We present S4VDPCA, a sparse PCA method that incorporates a subsampling approach, namely stability selection. S4VDPCA can consistently select the truly relevant variables contributing to a sparse PC while also consistently estimate the direction of maximal variability. The performance of the S4VDPCA is assessed in a simulation study and compared to other PCA approaches, as well as to a hypothetical oracle PCA that ‘knows’ the truly relevant features in advance and thus finds optimal, unbiased sparse PCs. S4VDPCA is computationally efficient and performs best in simulations regarding parameter estimation consistency and feature selection consistency. Furthermore, S4VDPCA is applied to a publicly available gene expression data set of medulloblastoma brain tumors. Features contributing to the first two estimated sparse PCs represent genes significantly over-represented in pathways typically deregulated between molecular subgroups of medulloblastoma. Availability and implementation: Software is available at https://github.com/mwsill/s4vdpca. Contact: m.sill@dkfz.de Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-4528629
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-45286292015-09-24 Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data Sill, Martin Saadati, Maral Benner, Axel Bioinformatics Original Papers Motivation: Principal component analysis (PCA) is a basic tool often used in bioinformatics for visualization and dimension reduction. However, it is known that PCA may not consistently estimate the true direction of maximal variability in high-dimensional, low sample size settings, which are typical for molecular data. Assuming that the underlying signal is sparse, i.e. that only a fraction of features contribute to a principal component (PC), this estimation consistency can be retained. Most existing sparse PCA methods use L1-penalization, i.e. the lasso, to perform feature selection. But, the lasso is known to lack variable selection consistency in high dimensions and therefore a subsequent interpretation of selected features can give misleading results. Results: We present S4VDPCA, a sparse PCA method that incorporates a subsampling approach, namely stability selection. S4VDPCA can consistently select the truly relevant variables contributing to a sparse PC while also consistently estimate the direction of maximal variability. The performance of the S4VDPCA is assessed in a simulation study and compared to other PCA approaches, as well as to a hypothetical oracle PCA that ‘knows’ the truly relevant features in advance and thus finds optimal, unbiased sparse PCs. S4VDPCA is computationally efficient and performs best in simulations regarding parameter estimation consistency and feature selection consistency. Furthermore, S4VDPCA is applied to a publicly available gene expression data set of medulloblastoma brain tumors. Features contributing to the first two estimated sparse PCs represent genes significantly over-represented in pathways typically deregulated between molecular subgroups of medulloblastoma. Availability and implementation: Software is available at https://github.com/mwsill/s4vdpca. Contact: m.sill@dkfz.de Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2015-08-15 2015-04-10 /pmc/articles/PMC4528629/ /pubmed/25861969 http://dx.doi.org/10.1093/bioinformatics/btv197 Text en © The Author 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Sill, Martin
Saadati, Maral
Benner, Axel
Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
title Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
title_full Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
title_fullStr Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
title_full_unstemmed Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
title_short Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
title_sort applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4528629/
https://www.ncbi.nlm.nih.gov/pubmed/25861969
http://dx.doi.org/10.1093/bioinformatics/btv197
work_keys_str_mv AT sillmartin applyingstabilityselectiontoconsistentlyestimatesparseprincipalcomponentsinhighdimensionalmoleculardata
AT saadatimaral applyingstabilityselectiontoconsistentlyestimatesparseprincipalcomponentsinhighdimensionalmoleculardata
AT benneraxel applyingstabilityselectiontoconsistentlyestimatesparseprincipalcomponentsinhighdimensionalmoleculardata