Cargando…

Compression of quantification uncertainty for scRNA-seq counts

MOTIVATION: Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many g...

Descripción completa

Detalles Bibliográficos
Autores principales:	Van Buren, Scott, Sarkar, Hirak, Srivastava, Avi, Rashid, Naim U, Patro, Rob, Love, Michael I
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8289386/ https://www.ncbi.nlm.nih.gov/pubmed/33471073 http://dx.doi.org/10.1093/bioinformatics/btab001

_version_	1783724290402680832
author	Van Buren, Scott Sarkar, Hirak Srivastava, Avi Rashid, Naim U Patro, Rob Love, Michael I
author_facet	Van Buren, Scott Sarkar, Hirak Srivastava, Avi Rashid, Naim U Patro, Rob Love, Michael I
author_sort	Van Buren, Scott
collection	PubMed
description	MOTIVATION: Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of ‘inferential replicates’, which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. RESULTS: We demonstrate that storing only the mean and variance from a set of inferential replicates (‘compression’) is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate ‘pseudo-inferential’ replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. AVAILABILITY AND IMPLEMENTATION: makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper’s GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-8289386
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-82893862021-07-20 Compression of quantification uncertainty for scRNA-seq counts Van Buren, Scott Sarkar, Hirak Srivastava, Avi Rashid, Naim U Patro, Rob Love, Michael I Bioinformatics Original Papers MOTIVATION: Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of ‘inferential replicates’, which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. RESULTS: We demonstrate that storing only the mean and variance from a set of inferential replicates (‘compression’) is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate ‘pseudo-inferential’ replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. AVAILABILITY AND IMPLEMENTATION: makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper’s GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-01-20 /pmc/articles/PMC8289386/ /pubmed/33471073 http://dx.doi.org/10.1093/bioinformatics/btab001 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Papers Van Buren, Scott Sarkar, Hirak Srivastava, Avi Rashid, Naim U Patro, Rob Love, Michael I Compression of quantification uncertainty for scRNA-seq counts
title	Compression of quantification uncertainty for scRNA-seq counts
title_full	Compression of quantification uncertainty for scRNA-seq counts
title_fullStr	Compression of quantification uncertainty for scRNA-seq counts
title_full_unstemmed	Compression of quantification uncertainty for scRNA-seq counts
title_short	Compression of quantification uncertainty for scRNA-seq counts
title_sort	compression of quantification uncertainty for scrna-seq counts
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8289386/ https://www.ncbi.nlm.nih.gov/pubmed/33471073 http://dx.doi.org/10.1093/bioinformatics/btab001
work_keys_str_mv	AT vanburenscott compressionofquantificationuncertaintyforscrnaseqcounts AT sarkarhirak compressionofquantificationuncertaintyforscrnaseqcounts AT srivastavaavi compressionofquantificationuncertaintyforscrnaseqcounts AT rashidnaimu compressionofquantificationuncertaintyforscrnaseqcounts AT patrorob compressionofquantificationuncertaintyforscrnaseqcounts AT lovemichaeli compressionofquantificationuncertaintyforscrnaseqcounts

Compression of quantification uncertainty for scRNA-seq counts

Ejemplares similares