Cargando…

Fast and memory-efficient scRNA-seq k-means clustering with various distances

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means c...

Descripción completa

Detalles Bibliográficos
Autores principales:	Baker, Daniel N., Dyjack, Nathan, Braverman, Vladimir, Hicks, Stephanie C., Langmead, Ben
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8586878/ https://www.ncbi.nlm.nih.gov/pubmed/34778889 http://dx.doi.org/10.1145/3459930.3469523

_version_	1784597974229188608
author	Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben
author_facet	Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben
author_sort	Baker, Daniel N.
collection	PubMed
description	Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.
format	Online Article Text
id	pubmed-8586878
institution	National Center for Biotechnology Information
language	English
publishDate	2021
record_format	MEDLINE/PubMed
spelling	pubmed-85868782021-11-12 Fast and memory-efficient scRNA-seq k-means clustering with various distances Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben ACM BCB Article Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. 2021-08 /pmc/articles/PMC8586878/ /pubmed/34778889 http://dx.doi.org/10.1145/3459930.3469523 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution International 4.0 License (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben Fast and memory-efficient scRNA-seq k-means clustering with various distances
title	Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_full	Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_fullStr	Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_full_unstemmed	Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_short	Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_sort	fast and memory-efficient scrna-seq k-means clustering with various distances
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8586878/ https://www.ncbi.nlm.nih.gov/pubmed/34778889 http://dx.doi.org/10.1145/3459930.3469523
work_keys_str_mv	AT bakerdanieln fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT dyjacknathan fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT bravermanvladimir fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT hicksstephaniec fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT langmeadben fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances

Fast and memory-efficient scRNA-seq k-means clustering with various distances

Ejemplares similares