Cargando…
Fast and memory-efficient scRNA-seq k-means clustering with various distances
Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means c...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8586878/ https://www.ncbi.nlm.nih.gov/pubmed/34778889 http://dx.doi.org/10.1145/3459930.3469523 |
_version_ | 1784597974229188608 |
---|---|
author | Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben |
author_facet | Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben |
author_sort | Baker, Daniel N. |
collection | PubMed |
description | Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. |
format | Online Article Text |
id | pubmed-8586878 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
record_format | MEDLINE/PubMed |
spelling | pubmed-85868782021-11-12 Fast and memory-efficient scRNA-seq k-means clustering with various distances Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben ACM BCB Article Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattachaiyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. 2021-08 /pmc/articles/PMC8586878/ /pubmed/34778889 http://dx.doi.org/10.1145/3459930.3469523 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution International 4.0 License (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Baker, Daniel N. Dyjack, Nathan Braverman, Vladimir Hicks, Stephanie C. Langmead, Ben Fast and memory-efficient scRNA-seq k-means clustering with various distances |
title | Fast and memory-efficient scRNA-seq k-means clustering with various distances |
title_full | Fast and memory-efficient scRNA-seq k-means clustering with various distances |
title_fullStr | Fast and memory-efficient scRNA-seq k-means clustering with various distances |
title_full_unstemmed | Fast and memory-efficient scRNA-seq k-means clustering with various distances |
title_short | Fast and memory-efficient scRNA-seq k-means clustering with various distances |
title_sort | fast and memory-efficient scrna-seq k-means clustering with various distances |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8586878/ https://www.ncbi.nlm.nih.gov/pubmed/34778889 http://dx.doi.org/10.1145/3459930.3469523 |
work_keys_str_mv | AT bakerdanieln fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT dyjacknathan fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT bravermanvladimir fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT hicksstephaniec fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances AT langmeadben fastandmemoryefficientscrnaseqkmeansclusteringwithvariousdistances |