Cargando…
Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data
A fundamental task in single-cell RNA-seq (scRNA-seq) analysis is the identification of transcriptionally distinct groups of cells. Numerous methods have been proposed for this problem, with a recent focus on methods for the cluster analysis of ultralarge scRNA-seq data sets produced by droplet-base...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8015854/ https://www.ncbi.nlm.nih.gov/pubmed/33627473 http://dx.doi.org/10.1101/gr.267906.120 |
_version_ | 1783673759992905728 |
---|---|
author | Do, Van Hoan Rojas Ringeling, Francisca Canzar, Stefan |
author_facet | Do, Van Hoan Rojas Ringeling, Francisca Canzar, Stefan |
author_sort | Do, Van Hoan |
collection | PubMed |
description | A fundamental task in single-cell RNA-seq (scRNA-seq) analysis is the identification of transcriptionally distinct groups of cells. Numerous methods have been proposed for this problem, with a recent focus on methods for the cluster analysis of ultralarge scRNA-seq data sets produced by droplet-based sequencing technologies. Most existing methods rely on a sampling step to bridge the gap between algorithm scalability and volume of the data. Ignoring large parts of the data, however, often yields inaccurate groupings of cells and risks overlooking rare cell types. We propose method Specter that adopts and extends recent algorithmic advances in (fast) spectral clustering. In contrast to methods that cluster a (random) subsample of the data, we adopt the idea of landmarks that are used to create a sparse representation of the full data from which a spectral embedding can then be computed in linear time. We exploit Specter's speed in a cluster ensemble scheme that achieves a substantial improvement in accuracy over existing methods and identifies rare cell types with high sensitivity. Its linear-time complexity allows Specter to scale to millions of cells and leads to fast computation times in practice. Furthermore, on CITE-seq data that simultaneously measures gene and protein marker expression, we show that Specter is able to use multimodal omics measurements to resolve subtle transcriptomic differences between subpopulations of cells. |
format | Online Article Text |
id | pubmed-8015854 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-80158542021-10-01 Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data Do, Van Hoan Rojas Ringeling, Francisca Canzar, Stefan Genome Res Method A fundamental task in single-cell RNA-seq (scRNA-seq) analysis is the identification of transcriptionally distinct groups of cells. Numerous methods have been proposed for this problem, with a recent focus on methods for the cluster analysis of ultralarge scRNA-seq data sets produced by droplet-based sequencing technologies. Most existing methods rely on a sampling step to bridge the gap between algorithm scalability and volume of the data. Ignoring large parts of the data, however, often yields inaccurate groupings of cells and risks overlooking rare cell types. We propose method Specter that adopts and extends recent algorithmic advances in (fast) spectral clustering. In contrast to methods that cluster a (random) subsample of the data, we adopt the idea of landmarks that are used to create a sparse representation of the full data from which a spectral embedding can then be computed in linear time. We exploit Specter's speed in a cluster ensemble scheme that achieves a substantial improvement in accuracy over existing methods and identifies rare cell types with high sensitivity. Its linear-time complexity allows Specter to scale to millions of cells and leads to fast computation times in practice. Furthermore, on CITE-seq data that simultaneously measures gene and protein marker expression, we show that Specter is able to use multimodal omics measurements to resolve subtle transcriptomic differences between subpopulations of cells. Cold Spring Harbor Laboratory Press 2021-04 /pmc/articles/PMC8015854/ /pubmed/33627473 http://dx.doi.org/10.1101/gr.267906.120 Text en © 2021 Do et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/. |
spellingShingle | Method Do, Van Hoan Rojas Ringeling, Francisca Canzar, Stefan Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data |
title | Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data |
title_full | Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data |
title_fullStr | Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data |
title_full_unstemmed | Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data |
title_short | Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data |
title_sort | linear-time cluster ensembles of large-scale single-cell rna-seq and multimodal data |
topic | Method |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8015854/ https://www.ncbi.nlm.nih.gov/pubmed/33627473 http://dx.doi.org/10.1101/gr.267906.120 |
work_keys_str_mv | AT dovanhoan lineartimeclusterensemblesoflargescalesinglecellrnaseqandmultimodaldata AT rojasringelingfrancisca lineartimeclusterensemblesoflargescalesinglecellrnaseqandmultimodaldata AT canzarstefan lineartimeclusterensemblesoflargescalesinglecellrnaseqandmultimodaldata |