Cargando…

The Poisson distribution model fits UMI-based single-cell RNA-sequencing data

Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggre...

Descripción completa

Detalles Bibliográficos
Autores principales: Pan, Yue, Landis, Justin T., Moorad, Razia, Wu, Di, Marron, J.S., Dittmer, Dirk P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Journal Experts 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9934739/
https://www.ncbi.nlm.nih.gov/pubmed/36798423
http://dx.doi.org/10.21203/rs.3.rs-2517698/v1
_version_ 1784889937467801600
author Pan, Yue
Landis, Justin T.
Moorad, Razia
Wu, Di
Marron, J.S.
Dittmer, Dirk P.
author_facet Pan, Yue
Landis, Justin T.
Moorad, Razia
Wu, Di
Marron, J.S.
Dittmer, Dirk P.
author_sort Pan, Yue
collection PubMed
description Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results: We avoid the crude approximations entailed by such aggregation through proposing an Independent Poisson Distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions: This new method has multiple advantages, including (1) no needfor prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson .
format Online
Article
Text
id pubmed-9934739
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Journal Experts
record_format MEDLINE/PubMed
spelling pubmed-99347392023-02-17 The Poisson distribution model fits UMI-based single-cell RNA-sequencing data Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J.S. Dittmer, Dirk P. Res Sq Article Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results: We avoid the crude approximations entailed by such aggregation through proposing an Independent Poisson Distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions: This new method has multiple advantages, including (1) no needfor prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson . American Journal Experts 2023-02-06 /pmc/articles/PMC9934739/ /pubmed/36798423 http://dx.doi.org/10.21203/rs.3.rs-2517698/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Pan, Yue
Landis, Justin T.
Moorad, Razia
Wu, Di
Marron, J.S.
Dittmer, Dirk P.
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_full The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_fullStr The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_full_unstemmed The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_short The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_sort poisson distribution model fits umi-based single-cell rna-sequencing data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9934739/
https://www.ncbi.nlm.nih.gov/pubmed/36798423
http://dx.doi.org/10.21203/rs.3.rs-2517698/v1
work_keys_str_mv AT panyue thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT landisjustint thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT mooradrazia thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT wudi thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT marronjs thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT dittmerdirkp thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT panyue poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT landisjustint poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT mooradrazia poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT wudi poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT marronjs poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT dittmerdirkp poissondistributionmodelfitsumibasedsinglecellrnasequencingdata