Cargando…

Super-sparse principal component analyses for high-throughput genomic data

BACKGROUND: Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are t...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Donghwan, Lee, Woojoo, Lee, Youngjo, Pawitan, Yudi
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2902448/
https://www.ncbi.nlm.nih.gov/pubmed/20525176
http://dx.doi.org/10.1186/1471-2105-11-296
_version_ 1782183762358960128
author Lee, Donghwan
Lee, Woojoo
Lee, Youngjo
Pawitan, Yudi
author_facet Lee, Donghwan
Lee, Woojoo
Lee, Youngjo
Pawitan, Yudi
author_sort Lee, Donghwan
collection PubMed
description BACKGROUND: Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients. RESULTS: Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes. CONCLUSIONS: The new method has better performance than several existing methods, particularly in the estimation of the loading vectors.
format Text
id pubmed-2902448
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-29024482010-07-13 Super-sparse principal component analyses for high-throughput genomic data Lee, Donghwan Lee, Woojoo Lee, Youngjo Pawitan, Yudi BMC Bioinformatics Research article BACKGROUND: Principal component analysis (PCA) has gained popularity as a method for the analysis of high-dimensional genomic data. However, it is often difficult to interpret the results because the principal components are linear combinations of all variables, and the coefficients (loadings) are typically nonzero. These nonzero values also reflect poor estimation of the true vector loadings; for example, for gene expression data, biologically we expect only a portion of the genes to be expressed in any tissue, and an even smaller fraction to be involved in a particular process. Sparse PCA methods have recently been introduced for reducing the number of nonzero coefficients, but these existing methods are not satisfactory for high-dimensional data applications because they still give too many nonzero coefficients. RESULTS: Here we propose a new PCA method that uses two innovations to produce an extremely sparse loading vector: (i) a random-effect model on the loadings that leads to an unbounded penalty at the origin and (ii) shrinkage of the singular values obtained from the singular value decomposition of the data matrix. We develop a stable computing algorithm by modifying nonlinear iterative partial least square (NIPALS) algorithm, and illustrate the method with an analysis of the NCI cancer dataset that contains 21,225 genes. CONCLUSIONS: The new method has better performance than several existing methods, particularly in the estimation of the loading vectors. BioMed Central 2010-06-02 /pmc/articles/PMC2902448/ /pubmed/20525176 http://dx.doi.org/10.1186/1471-2105-11-296 Text en Copyright ©2010 Lee et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Lee, Donghwan
Lee, Woojoo
Lee, Youngjo
Pawitan, Yudi
Super-sparse principal component analyses for high-throughput genomic data
title Super-sparse principal component analyses for high-throughput genomic data
title_full Super-sparse principal component analyses for high-throughput genomic data
title_fullStr Super-sparse principal component analyses for high-throughput genomic data
title_full_unstemmed Super-sparse principal component analyses for high-throughput genomic data
title_short Super-sparse principal component analyses for high-throughput genomic data
title_sort super-sparse principal component analyses for high-throughput genomic data
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2902448/
https://www.ncbi.nlm.nih.gov/pubmed/20525176
http://dx.doi.org/10.1186/1471-2105-11-296
work_keys_str_mv AT leedonghwan supersparseprincipalcomponentanalysesforhighthroughputgenomicdata
AT leewoojoo supersparseprincipalcomponentanalysesforhighthroughputgenomicdata
AT leeyoungjo supersparseprincipalcomponentanalysesforhighthroughputgenomicdata
AT pawitanyudi supersparseprincipalcomponentanalysesforhighthroughputgenomicdata