Cargando…

Dirichlet process mixture models for single-cell RNA-seq clustering

Clustering of cells based on gene expression is one of the major steps in single-cell RNA-sequencing (scRNA-seq) data analysis. One key challenge in cluster analysis is the unknown number of clusters and, for this issue, there is still no comprehensive solution. To enhance the process of defining me...

Descripción completa

Detalles Bibliográficos
Autores principales:	Adossa, Nigatu A., Rytkönen, Kalle T., Elo, Laura L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	The Company of Biologists Ltd 2022
Materias:	Methods and Techniques
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9002799/ https://www.ncbi.nlm.nih.gov/pubmed/35237784 http://dx.doi.org/10.1242/bio.059001

_version_	1784685976947261440
author	Adossa, Nigatu A. Rytkönen, Kalle T. Elo, Laura L.
author_facet	Adossa, Nigatu A. Rytkönen, Kalle T. Elo, Laura L.
author_sort	Adossa, Nigatu A.
collection	PubMed
description	Clustering of cells based on gene expression is one of the major steps in single-cell RNA-sequencing (scRNA-seq) data analysis. One key challenge in cluster analysis is the unknown number of clusters and, for this issue, there is still no comprehensive solution. To enhance the process of defining meaningful cluster resolution, we compare Bayesian latent Dirichlet allocation (LDA) method to its non-parametric counterpart, hierarchical Dirichlet process (HDP) in the context of clustering scRNA-seq data. A potential main advantage of HDP is that it does not require the number of clusters as an input parameter from the user. While LDA has been used in single-cell data analysis, it has not been compared in detail with HDP. Here, we compare the cell clustering performance of LDA and HDP using four scRNA-seq datasets (immune cells, kidney, pancreas and decidua/placenta), with a specific focus on cluster numbers. Using both intrinsic (DB-index) and extrinsic (ARI) cluster quality measures, we show that the performance of LDA and HDP is dataset dependent. We describe a case where HDP produced a more appropriate clustering compared to the best performer from a series of LDA clusterings with different numbers of clusters. However, we also observed cases where the best performing LDA cluster numbers appropriately capture the main biological features while HDP tended to inflate the number of clusters. Overall, our study highlights the importance of carefully assessing the number of clusters when analyzing scRNA-seq data.
format	Online Article Text
id	pubmed-9002799
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	The Company of Biologists Ltd
record_format	MEDLINE/PubMed
spelling	pubmed-90027992022-04-12 Dirichlet process mixture models for single-cell RNA-seq clustering Adossa, Nigatu A. Rytkönen, Kalle T. Elo, Laura L. Biol Open Methods and Techniques Clustering of cells based on gene expression is one of the major steps in single-cell RNA-sequencing (scRNA-seq) data analysis. One key challenge in cluster analysis is the unknown number of clusters and, for this issue, there is still no comprehensive solution. To enhance the process of defining meaningful cluster resolution, we compare Bayesian latent Dirichlet allocation (LDA) method to its non-parametric counterpart, hierarchical Dirichlet process (HDP) in the context of clustering scRNA-seq data. A potential main advantage of HDP is that it does not require the number of clusters as an input parameter from the user. While LDA has been used in single-cell data analysis, it has not been compared in detail with HDP. Here, we compare the cell clustering performance of LDA and HDP using four scRNA-seq datasets (immune cells, kidney, pancreas and decidua/placenta), with a specific focus on cluster numbers. Using both intrinsic (DB-index) and extrinsic (ARI) cluster quality measures, we show that the performance of LDA and HDP is dataset dependent. We describe a case where HDP produced a more appropriate clustering compared to the best performer from a series of LDA clusterings with different numbers of clusters. However, we also observed cases where the best performing LDA cluster numbers appropriately capture the main biological features while HDP tended to inflate the number of clusters. Overall, our study highlights the importance of carefully assessing the number of clusters when analyzing scRNA-seq data. The Company of Biologists Ltd 2022-04-04 /pmc/articles/PMC9002799/ /pubmed/35237784 http://dx.doi.org/10.1242/bio.059001 Text en © 2022. Published by The Company of Biologists Ltd https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.
spellingShingle	Methods and Techniques Adossa, Nigatu A. Rytkönen, Kalle T. Elo, Laura L. Dirichlet process mixture models for single-cell RNA-seq clustering
title	Dirichlet process mixture models for single-cell RNA-seq clustering
title_full	Dirichlet process mixture models for single-cell RNA-seq clustering
title_fullStr	Dirichlet process mixture models for single-cell RNA-seq clustering
title_full_unstemmed	Dirichlet process mixture models for single-cell RNA-seq clustering
title_short	Dirichlet process mixture models for single-cell RNA-seq clustering
title_sort	dirichlet process mixture models for single-cell rna-seq clustering
topic	Methods and Techniques
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9002799/ https://www.ncbi.nlm.nih.gov/pubmed/35237784 http://dx.doi.org/10.1242/bio.059001
work_keys_str_mv	AT adossanigatua dirichletprocessmixturemodelsforsinglecellrnaseqclustering AT rytkonenkallet dirichletprocessmixturemodelsforsinglecellrnaseqclustering AT elolaural dirichletprocessmixturemodelsforsinglecellrnaseqclustering

Dirichlet process mixture models for single-cell RNA-seq clustering

Ejemplares similares