Cargando…

Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

BACKGROUND: Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statisti...

Descripción completa

Detalles Bibliográficos
Autores principales: Lause, Jan, Berens, Philipp, Kobak, Dmitry
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8419999/
https://www.ncbi.nlm.nih.gov/pubmed/34488842
http://dx.doi.org/10.1186/s13059-021-02451-7
_version_ 1783748871114981376
author Lause, Jan
Berens, Philipp
Kobak, Dmitry
author_facet Lause, Jan
Berens, Philipp
Kobak, Dmitry
author_sort Lause, Jan
collection PubMed
description BACKGROUND: Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. RESULTS: We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. CONCLUSIONS: We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s13059-021-02451-7).
format Online
Article
Text
id pubmed-8419999
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-84199992021-09-09 Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data Lause, Jan Berens, Philipp Kobak, Dmitry Genome Biol Short Report BACKGROUND: Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. RESULTS: We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. CONCLUSIONS: We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (10.1186/s13059-021-02451-7). BioMed Central 2021-09-06 /pmc/articles/PMC8419999/ /pubmed/34488842 http://dx.doi.org/10.1186/s13059-021-02451-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Short Report
Lause, Jan
Berens, Philipp
Kobak, Dmitry
Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_full Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_fullStr Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_full_unstemmed Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_short Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_sort analytic pearson residuals for normalization of single-cell rna-seq umi data
topic Short Report
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8419999/
https://www.ncbi.nlm.nih.gov/pubmed/34488842
http://dx.doi.org/10.1186/s13059-021-02451-7
work_keys_str_mv AT lausejan analyticpearsonresidualsfornormalizationofsinglecellrnasequmidata
AT berensphilipp analyticpearsonresidualsfornormalizationofsinglecellrnasequmidata
AT kobakdmitry analyticpearsonresidualsfornormalizationofsinglecellrnasequmidata