Cargando…
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
BACKGROUND: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggre...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276395/ https://www.ncbi.nlm.nih.gov/pubmed/37330471 http://dx.doi.org/10.1186/s12859-023-05349-2 |
_version_ | 1785060067790290944 |
---|---|
author | Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J. S. Dittmer, Dirk P. |
author_facet | Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J. S. Dittmer, Dirk P. |
author_sort | Pan, Yue |
collection | PubMed |
description | BACKGROUND: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS: We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS: This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05349-2. |
format | Online Article Text |
id | pubmed-10276395 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-102763952023-06-18 The Poisson distribution model fits UMI-based single-cell RNA-sequencing data Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J. S. Dittmer, Dirk P. BMC Bioinformatics Research BACKGROUND: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS: We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS: This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05349-2. BioMed Central 2023-06-17 /pmc/articles/PMC10276395/ /pubmed/37330471 http://dx.doi.org/10.1186/s12859-023-05349-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Pan, Yue Landis, Justin T. Moorad, Razia Wu, Di Marron, J. S. Dittmer, Dirk P. The Poisson distribution model fits UMI-based single-cell RNA-sequencing data |
title | The Poisson distribution model fits UMI-based single-cell RNA-sequencing data |
title_full | The Poisson distribution model fits UMI-based single-cell RNA-sequencing data |
title_fullStr | The Poisson distribution model fits UMI-based single-cell RNA-sequencing data |
title_full_unstemmed | The Poisson distribution model fits UMI-based single-cell RNA-sequencing data |
title_short | The Poisson distribution model fits UMI-based single-cell RNA-sequencing data |
title_sort | poisson distribution model fits umi-based single-cell rna-sequencing data |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276395/ https://www.ncbi.nlm.nih.gov/pubmed/37330471 http://dx.doi.org/10.1186/s12859-023-05349-2 |
work_keys_str_mv | AT panyue thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT landisjustint thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT mooradrazia thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT wudi thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT marronjs thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT dittmerdirkp thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT panyue poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT landisjustint poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT mooradrazia poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT wudi poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT marronjs poissondistributionmodelfitsumibasedsinglecellrnasequencingdata AT dittmerdirkp poissondistributionmodelfitsumibasedsinglecellrnasequencingdata |