Cargando…

A systematic performance evaluation of clustering methods for single-cell RNA-seq data

Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we pro...

Descripción completa

Detalles Bibliográficos
Autores principales: Duò, Angelo, Robinson, Mark D., Soneson, Charlotte
Formato: Online Artículo Texto
Lenguaje:English
Publicado: F1000 Research Limited 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6134335/
https://www.ncbi.nlm.nih.gov/pubmed/30271584
http://dx.doi.org/10.12688/f1000research.15666.3
_version_ 1783354649073418240
author Duò, Angelo
Robinson, Mark D.
Soneson, Charlotte
author_facet Duò, Angelo
Robinson, Mark D.
Soneson, Charlotte
author_sort Duò, Angelo
collection PubMed
description Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub ( https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor ( https://bioconductor.org/packages/DuoClustering2018).
format Online
Article
Text
id pubmed-6134335
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher F1000 Research Limited
record_format MEDLINE/PubMed
spelling pubmed-61343352018-09-27 A systematic performance evaluation of clustering methods for single-cell RNA-seq data Duò, Angelo Robinson, Mark D. Soneson, Charlotte F1000Res Research Article Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub ( https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor ( https://bioconductor.org/packages/DuoClustering2018). F1000 Research Limited 2020-11-16 /pmc/articles/PMC6134335/ /pubmed/30271584 http://dx.doi.org/10.12688/f1000research.15666.3 Text en Copyright: © 2020 Duò A et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Duò, Angelo
Robinson, Mark D.
Soneson, Charlotte
A systematic performance evaluation of clustering methods for single-cell RNA-seq data
title A systematic performance evaluation of clustering methods for single-cell RNA-seq data
title_full A systematic performance evaluation of clustering methods for single-cell RNA-seq data
title_fullStr A systematic performance evaluation of clustering methods for single-cell RNA-seq data
title_full_unstemmed A systematic performance evaluation of clustering methods for single-cell RNA-seq data
title_short A systematic performance evaluation of clustering methods for single-cell RNA-seq data
title_sort systematic performance evaluation of clustering methods for single-cell rna-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6134335/
https://www.ncbi.nlm.nih.gov/pubmed/30271584
http://dx.doi.org/10.12688/f1000research.15666.3
work_keys_str_mv AT duoangelo asystematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdata
AT robinsonmarkd asystematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdata
AT sonesoncharlotte asystematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdata
AT duoangelo systematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdata
AT robinsonmarkd systematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdata
AT sonesoncharlotte systematicperformanceevaluationofclusteringmethodsforsinglecellrnaseqdata