Cargando…

The Poisson distribution model fits UMI-based single-cell RNA-sequencing data

BACKGROUND: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggre...

Descripción completa

Detalles Bibliográficos
Autores principales: Pan, Yue, Landis, Justin T., Moorad, Razia, Wu, Di, Marron, J. S., Dittmer, Dirk P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276395/
https://www.ncbi.nlm.nih.gov/pubmed/37330471
http://dx.doi.org/10.1186/s12859-023-05349-2
_version_ 1785060067790290944
author Pan, Yue
Landis, Justin T.
Moorad, Razia
Wu, Di
Marron, J. S.
Dittmer, Dirk P.
author_facet Pan, Yue
Landis, Justin T.
Moorad, Razia
Wu, Di
Marron, J. S.
Dittmer, Dirk P.
author_sort Pan, Yue
collection PubMed
description BACKGROUND: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS: We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS: This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05349-2.
format Online
Article
Text
id pubmed-10276395
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-102763952023-06-18 The Poisson distribution model fits UMI-based single-cell RNA-sequencing data Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J. S. Dittmer, Dirk P. BMC Bioinformatics Research BACKGROUND: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS: We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS: This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05349-2. BioMed Central 2023-06-17 /pmc/articles/PMC10276395/ /pubmed/37330471 http://dx.doi.org/10.1186/s12859-023-05349-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Pan, Yue
Landis, Justin T.
Moorad, Razia
Wu, Di
Marron, J. S.
Dittmer, Dirk P.
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_full The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_fullStr The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_full_unstemmed The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_short The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_sort poisson distribution model fits umi-based single-cell rna-sequencing data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276395/
https://www.ncbi.nlm.nih.gov/pubmed/37330471
http://dx.doi.org/10.1186/s12859-023-05349-2
work_keys_str_mv AT panyue thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT landisjustint thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT mooradrazia thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT wudi thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT marronjs thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT dittmerdirkp thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT panyue poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT landisjustint poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT mooradrazia poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT wudi poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT marronjs poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT dittmerdirkp poissondistributionmodelfitsumibasedsinglecellrnasequencingdata