Cargando…
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggre...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Journal Experts
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9934739/ https://www.ncbi.nlm.nih.gov/pubmed/36798423 http://dx.doi.org/10.21203/rs.3.rs-2517698/v1 |
_version_ | 1784889937467801600 |
---|---|
author | Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J.S. Dittmer, Dirk P. |
author_facet | Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J.S. Dittmer, Dirk P. |
author_sort | Pan, Yue |
collection | PubMed |
description | Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results: We avoid the crude approximations entailed by such aggregation through proposing an Independent Poisson Distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions: This new method has multiple advantages, including (1) no needfor prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson . |
format | Online Article Text |
id | pubmed-9934739 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | American Journal Experts |
record_format | MEDLINE/PubMed |
spelling | pubmed-99347392023-02-17 The Poisson distribution model fits UMI-based single-cell RNA-sequencing data Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J.S. Dittmer, Dirk P. Res Sq Article Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results: We avoid the crude approximations entailed by such aggregation through proposing an Independent Poisson Distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions: This new method has multiple advantages, including (1) no needfor prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson . American Journal Experts 2023-02-06 /pmc/articles/PMC9934739/ /pubmed/36798423 http://dx.doi.org/10.21203/rs.3.rs-2517698/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J.S. Dittmer, Dirk P. The Poisson distribution model fits UMI-based single-cell RNA-sequencing data |
title |
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
|
title_full |
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
|
title_fullStr |
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
|
title_full_unstemmed |
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
|
title_short |
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
|
title_sort | poisson distribution model fits umi-based single-cell rna-sequencing data |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9934739/ https://www.ncbi.nlm.nih.gov/pubmed/36798423 http://dx.doi.org/10.21203/rs.3.rs-2517698/v1 |
work_keys_str_mv | AT panyue thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT landisjustint thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT mooradrazia thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT wudi thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT marronjs thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT dittmerdirkp thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT panyue poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT landisjustint poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT mooradrazia poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT wudi poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT marronjs poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT dittmerdirkp poissondistributionmodelfitsumibasedsinglecellrnasequencingdata |