Cargando…

Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data

BACKGROUND: A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. Various scRNA-seq data clustering algorithms have been specifically designed to auto...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yu, Lijia, Cao, Yue, Yang, Jean Y. H., Yang, Pengyi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8822786/ https://www.ncbi.nlm.nih.gov/pubmed/35135612 http://dx.doi.org/10.1186/s13059-022-02622-0

_version_	1784646672052125696
author	Yu, Lijia Cao, Yue Yang, Jean Y. H. Yang, Pengyi
author_facet	Yu, Lijia Cao, Yue Yang, Jean Y. H. Yang, Pengyi
author_sort	Yu, Lijia
collection	PubMed
description	BACKGROUND: A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. Various scRNA-seq data clustering algorithms have been specifically designed to automatically estimate the number of cell types through optimising the number of clusters in a dataset. The lack of benchmark studies, however, complicates the choice of the methods. RESULTS: We systematically benchmark a range of popular clustering algorithms on estimating the number of cell types in a variety of settings by sampling from the Tabula Muris data to create scRNA-seq datasets with a varying number of cell types, varying number of cells in each cell type, and different cell type proportions. The large number of datasets enables us to assess the performance of the algorithms, covering four broad categories of approaches, from various aspects using a panel of criteria. We further cross-compared the performance on datasets with high cell numbers using Tabula Muris and Tabula Sapiens data. CONCLUSIONS: We identify the strengths and weaknesses of each method on multiple criteria including the deviation of estimation from the true number of cell types, variability of estimation, clustering concordance of cells to their predefined cell types, and running time and peak memory usage. We then summarise these results into a multi-aspect recommendation to the users. The proposed stability-based approach for estimating the number of cell types is implemented in an R package and is freely available from (https://github.com/PYangLab/scCCESS). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-022-02622-0.
format	Online Article Text
id	pubmed-8822786
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-88227862022-02-08 Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data Yu, Lijia Cao, Yue Yang, Jean Y. H. Yang, Pengyi Genome Biol Research BACKGROUND: A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type identification. Various scRNA-seq data clustering algorithms have been specifically designed to automatically estimate the number of cell types through optimising the number of clusters in a dataset. The lack of benchmark studies, however, complicates the choice of the methods. RESULTS: We systematically benchmark a range of popular clustering algorithms on estimating the number of cell types in a variety of settings by sampling from the Tabula Muris data to create scRNA-seq datasets with a varying number of cell types, varying number of cells in each cell type, and different cell type proportions. The large number of datasets enables us to assess the performance of the algorithms, covering four broad categories of approaches, from various aspects using a panel of criteria. We further cross-compared the performance on datasets with high cell numbers using Tabula Muris and Tabula Sapiens data. CONCLUSIONS: We identify the strengths and weaknesses of each method on multiple criteria including the deviation of estimation from the true number of cell types, variability of estimation, clustering concordance of cells to their predefined cell types, and running time and peak memory usage. We then summarise these results into a multi-aspect recommendation to the users. The proposed stability-based approach for estimating the number of cell types is implemented in an R package and is freely available from (https://github.com/PYangLab/scCCESS). SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-022-02622-0. BioMed Central 2022-02-08 /pmc/articles/PMC8822786/ /pubmed/35135612 http://dx.doi.org/10.1186/s13059-022-02622-0 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Yu, Lijia Cao, Yue Yang, Jean Y. H. Yang, Pengyi Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data
title	Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data
title_full	Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data
title_fullStr	Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data
title_full_unstemmed	Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data
title_short	Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data
title_sort	benchmarking clustering algorithms on estimating the number of cell types from single-cell rna-sequencing data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8822786/ https://www.ncbi.nlm.nih.gov/pubmed/35135612 http://dx.doi.org/10.1186/s13059-022-02622-0
work_keys_str_mv	AT yulijia benchmarkingclusteringalgorithmsonestimatingthenumberofcelltypesfromsinglecellrnasequencingdata AT caoyue benchmarkingclusteringalgorithmsonestimatingthenumberofcelltypesfromsinglecellrnasequencingdata AT yangjeanyh benchmarkingclusteringalgorithmsonestimatingthenumberofcelltypesfromsinglecellrnasequencingdata AT yangpengyi benchmarkingclusteringalgorithmsonestimatingthenumberofcelltypesfromsinglecellrnasequencingdata

Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data

Ejemplares similares