Cargando…

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the rec...

Descripción completa

Detalles Bibliográficos
Autores principales:	Taguchi, Y-h., Turki, Turki
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521941/ https://www.ncbi.nlm.nih.gov/pubmed/36173994 http://dx.doi.org/10.1371/journal.pone.0275472

_version_	1784799955005734912
author	Taguchi, Y-h. Turki, Turki
author_facet	Taguchi, Y-h. Turki, Turki
author_sort	Taguchi, Y-h.
collection	PubMed
description	Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.
format	Online Article Text
id	pubmed-9521941
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-95219412022-09-30 Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools Taguchi, Y-h. Turki, Turki PLoS One Research Article Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time. Public Library of Science 2022-09-29 /pmc/articles/PMC9521941/ /pubmed/36173994 http://dx.doi.org/10.1371/journal.pone.0275472 Text en © 2022 Taguchi, Turki https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Taguchi, Y-h. Turki, Turki Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
title	Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
title_full	Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
title_fullStr	Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
title_full_unstemmed	Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
title_short	Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
title_sort	projection in genomic analysis: a theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9521941/ https://www.ncbi.nlm.nih.gov/pubmed/36173994 http://dx.doi.org/10.1371/journal.pone.0275472
work_keys_str_mv	AT taguchiyh projectioningenomicanalysisatheoreticalbasistorationalizetensordecompositionandprincipalcomponentanalysisasfeatureselectiontools AT turkiturki projectioningenomicanalysisatheoreticalbasistorationalizetensordecompositionandprincipalcomponentanalysisasfeatureselectiontools

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Ejemplares similares