Cargando…

K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes inv...

Descripción completa

Detalles Bibliográficos
Autores principales: Cottrell, Sean, Hozumi, Yuta, Wei, Guo-Wei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cornell University 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635285/
https://www.ncbi.nlm.nih.gov/pubmed/37961744
_version_ 1785146316481888256
author Cottrell, Sean
Hozumi, Yuta
Wei, Guo-Wei
author_facet Cottrell, Sean
Hozumi, Yuta
Wei, Guo-Wei
author_sort Cottrell, Sean
collection PubMed
description Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L(2,1) norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
format Online
Article
Text
id pubmed-10635285
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cornell University
record_format MEDLINE/PubMed
spelling pubmed-106352852023-11-13 K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis Cottrell, Sean Hozumi, Yuta Wei, Guo-Wei ArXiv Article Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L(2,1) norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric. Cornell University 2023-10-23 /pmc/articles/PMC10635285/ /pubmed/37961744 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Cottrell, Sean
Hozumi, Yuta
Wei, Guo-Wei
K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis
title K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis
title_full K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis
title_fullStr K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis
title_full_unstemmed K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis
title_short K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis
title_sort k-nearest-neighbors induced topological pca for single cell rna-sequence data analysis
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635285/
https://www.ncbi.nlm.nih.gov/pubmed/37961744
work_keys_str_mv AT cottrellsean knearestneighborsinducedtopologicalpcaforsinglecellrnasequencedataanalysis
AT hozumiyuta knearestneighborsinducedtopologicalpcaforsinglecellrnasequencedataanalysis
AT weiguowei knearestneighborsinducedtopologicalpcaforsinglecellrnasequencedataanalysis