Cargando…

Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data

A fundamental task in single-cell RNA-seq (scRNA-seq) analysis is the identification of transcriptionally distinct groups of cells. Numerous methods have been proposed for this problem, with a recent focus on methods for the cluster analysis of ultralarge scRNA-seq data sets produced by droplet-base...

Descripción completa

Detalles Bibliográficos
Autores principales: Do, Van Hoan, Rojas Ringeling, Francisca, Canzar, Stefan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8015854/
https://www.ncbi.nlm.nih.gov/pubmed/33627473
http://dx.doi.org/10.1101/gr.267906.120
_version_ 1783673759992905728
author Do, Van Hoan
Rojas Ringeling, Francisca
Canzar, Stefan
author_facet Do, Van Hoan
Rojas Ringeling, Francisca
Canzar, Stefan
author_sort Do, Van Hoan
collection PubMed
description A fundamental task in single-cell RNA-seq (scRNA-seq) analysis is the identification of transcriptionally distinct groups of cells. Numerous methods have been proposed for this problem, with a recent focus on methods for the cluster analysis of ultralarge scRNA-seq data sets produced by droplet-based sequencing technologies. Most existing methods rely on a sampling step to bridge the gap between algorithm scalability and volume of the data. Ignoring large parts of the data, however, often yields inaccurate groupings of cells and risks overlooking rare cell types. We propose method Specter that adopts and extends recent algorithmic advances in (fast) spectral clustering. In contrast to methods that cluster a (random) subsample of the data, we adopt the idea of landmarks that are used to create a sparse representation of the full data from which a spectral embedding can then be computed in linear time. We exploit Specter's speed in a cluster ensemble scheme that achieves a substantial improvement in accuracy over existing methods and identifies rare cell types with high sensitivity. Its linear-time complexity allows Specter to scale to millions of cells and leads to fast computation times in practice. Furthermore, on CITE-seq data that simultaneously measures gene and protein marker expression, we show that Specter is able to use multimodal omics measurements to resolve subtle transcriptomic differences between subpopulations of cells.
format Online
Article
Text
id pubmed-8015854
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-80158542021-10-01 Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data Do, Van Hoan Rojas Ringeling, Francisca Canzar, Stefan Genome Res Method A fundamental task in single-cell RNA-seq (scRNA-seq) analysis is the identification of transcriptionally distinct groups of cells. Numerous methods have been proposed for this problem, with a recent focus on methods for the cluster analysis of ultralarge scRNA-seq data sets produced by droplet-based sequencing technologies. Most existing methods rely on a sampling step to bridge the gap between algorithm scalability and volume of the data. Ignoring large parts of the data, however, often yields inaccurate groupings of cells and risks overlooking rare cell types. We propose method Specter that adopts and extends recent algorithmic advances in (fast) spectral clustering. In contrast to methods that cluster a (random) subsample of the data, we adopt the idea of landmarks that are used to create a sparse representation of the full data from which a spectral embedding can then be computed in linear time. We exploit Specter's speed in a cluster ensemble scheme that achieves a substantial improvement in accuracy over existing methods and identifies rare cell types with high sensitivity. Its linear-time complexity allows Specter to scale to millions of cells and leads to fast computation times in practice. Furthermore, on CITE-seq data that simultaneously measures gene and protein marker expression, we show that Specter is able to use multimodal omics measurements to resolve subtle transcriptomic differences between subpopulations of cells. Cold Spring Harbor Laboratory Press 2021-04 /pmc/articles/PMC8015854/ /pubmed/33627473 http://dx.doi.org/10.1101/gr.267906.120 Text en © 2021 Do et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Method
Do, Van Hoan
Rojas Ringeling, Francisca
Canzar, Stefan
Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data
title Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data
title_full Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data
title_fullStr Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data
title_full_unstemmed Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data
title_short Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data
title_sort linear-time cluster ensembles of large-scale single-cell rna-seq and multimodal data
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8015854/
https://www.ncbi.nlm.nih.gov/pubmed/33627473
http://dx.doi.org/10.1101/gr.267906.120
work_keys_str_mv AT dovanhoan lineartimeclusterensemblesoflargescalesinglecellrnaseqandmultimodaldata
AT rojasringelingfrancisca lineartimeclusterensemblesoflargescalesinglecellrnaseqandmultimodaldata
AT canzarstefan lineartimeclusterensemblesoflargescalesinglecellrnaseqandmultimodaldata