Cargando…

Hopper: a mathematically optimal algorithm for sketching biological data

MOTIVATION: Single-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for tod...

Descripción completa

Detalles Bibliográficos
Autores principales:	DeMeo, Benjamin, Berger, Bonnie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Macromolecular Sequence, Structure, and Function
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355272/ https://www.ncbi.nlm.nih.gov/pubmed/32657375 http://dx.doi.org/10.1093/bioinformatics/btaa408

_version_	1783558241971601408
author	DeMeo, Benjamin Berger, Bonnie
author_facet	DeMeo, Benjamin Berger, Bonnie
author_sort	DeMeo, Benjamin
collection	PubMed
description	MOTIVATION: Single-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today’s largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations. RESULTS: Here we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses. In a dataset of over 1.3 million mouse brain cells, Hopper detects a cluster of just 64 macrophages expressing inflammatory genes (0.004% of the full dataset) from a Hopper sketch containing just 5000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ∼2 million developing mouse organ cells, we show Hopper’s even representation of important cell types in small sketches, in contrast with prior sketching methods. We also introduce Treehopper, which uses spatial partitioning to speed up Hopper by orders of magnitude with minimal loss in performance. By condensing transcriptional information encoded in large datasets, Hopper and Treehopper grant the individual user with a laptop the analytic capabilities of a large consortium. AVAILABILITY AND IMPLEMENTATION: The code for Hopper is available at https://github.com/bendemeo/hopper. In addition, we have provided sketches of many of the largest single-cell datasets, available at http://hopper.csail.mit.edu.
format	Online Article Text
id	pubmed-7355272
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-73552722020-07-16 Hopper: a mathematically optimal algorithm for sketching biological data DeMeo, Benjamin Berger, Bonnie Bioinformatics Macromolecular Sequence, Structure, and Function MOTIVATION: Single-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today’s largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations. RESULTS: Here we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses. In a dataset of over 1.3 million mouse brain cells, Hopper detects a cluster of just 64 macrophages expressing inflammatory genes (0.004% of the full dataset) from a Hopper sketch containing just 5000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ∼2 million developing mouse organ cells, we show Hopper’s even representation of important cell types in small sketches, in contrast with prior sketching methods. We also introduce Treehopper, which uses spatial partitioning to speed up Hopper by orders of magnitude with minimal loss in performance. By condensing transcriptional information encoded in large datasets, Hopper and Treehopper grant the individual user with a laptop the analytic capabilities of a large consortium. AVAILABILITY AND IMPLEMENTATION: The code for Hopper is available at https://github.com/bendemeo/hopper. In addition, we have provided sketches of many of the largest single-cell datasets, available at http://hopper.csail.mit.edu. Oxford University Press 2020-07 2020-07-13 /pmc/articles/PMC7355272/ /pubmed/32657375 http://dx.doi.org/10.1093/bioinformatics/btaa408 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Macromolecular Sequence, Structure, and Function DeMeo, Benjamin Berger, Bonnie Hopper: a mathematically optimal algorithm for sketching biological data
title	Hopper: a mathematically optimal algorithm for sketching biological data
title_full	Hopper: a mathematically optimal algorithm for sketching biological data
title_fullStr	Hopper: a mathematically optimal algorithm for sketching biological data
title_full_unstemmed	Hopper: a mathematically optimal algorithm for sketching biological data
title_short	Hopper: a mathematically optimal algorithm for sketching biological data
title_sort	hopper: a mathematically optimal algorithm for sketching biological data
topic	Macromolecular Sequence, Structure, and Function
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355272/ https://www.ncbi.nlm.nih.gov/pubmed/32657375 http://dx.doi.org/10.1093/bioinformatics/btaa408
work_keys_str_mv	AT demeobenjamin hopperamathematicallyoptimalalgorithmforsketchingbiologicaldata AT bergerbonnie hopperamathematicallyoptimalalgorithmforsketchingbiologicaldata

Hopper: a mathematically optimal algorithm for sketching biological data

Ejemplares similares