Cargando…

Fast and memory-efficient scRNA-seq k-means clustering with various distances

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means c...

Descripción completa

Detalles Bibliográficos
Autores principales: Baker, Daniel N., Dyjack, Nathan, Braverman, Vladimir, Hicks, Stephanie C., Langmead, Ben
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8586878/
https://www.ncbi.nlm.nih.gov/pubmed/34778889
http://dx.doi.org/10.1145/3459930.3469523
_version_ 1784597974229188608
author Baker, Daniel N.
Dyjack, Nathan
Braverman, Vladimir
Hicks, Stephanie C.
Langmead, Ben
author_facet Baker, Daniel N.
Dyjack, Nathan
Braverman, Vladimir
Hicks, Stephanie C.
Langmead, Ben
author_sort Baker, Daniel N.
collection PubMed
description Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.
format Online
Article
Text
id pubmed-8586878
institution National Center for Biotechnology Information
language English
publishDate 2021
record_format MEDLINE/PubMed
spelling pubmed-85868782021-11-12 Fast and memory-efficient scRNA-seq k-means clustering with various distances Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben ACM BCB Article Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. 2021-08 /pmc/articles/PMC8586878/ /pubmed/34778889 http://dx.doi.org/10.1145/3459930.3469523 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution International 4.0 License (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Baker, Daniel N.
Dyjack, Nathan
Braverman, Vladimir
Hicks, Stephanie C.
Langmead, Ben
Fast and memory-efficient scRNA-seq k-means clustering with various distances
title Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_full Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_fullStr Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_full_unstemmed Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_short Fast and memory-efficient scRNA-seq k-means clustering with various distances
title_sort fast and memory-efficient scrna-seq k-means clustering with various distances
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8586878/
https://www.ncbi.nlm.nih.gov/pubmed/34778889
http://dx.doi.org/10.1145/3459930.3469523
work_keys_str_mv AT bakerdanieln fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances
AT dyjacknathan fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances
AT bravermanvladimir fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances
AT hicksstephaniec fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances
AT langmeadben fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances