Cargando…

A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data

The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and thi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cao, Xiaowen, Xing, Li, Majd, Elham, He, Hua, Gu, Junhua, Zhang, Xuekui
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8905542/ https://www.ncbi.nlm.nih.gov/pubmed/35281805 http://dx.doi.org/10.3389/fgene.2022.836798

_version_	1784665210285457408
author	Cao, Xiaowen Xing, Li Majd, Elham He, Hua Gu, Junhua Zhang, Xuekui
author_facet	Cao, Xiaowen Xing, Li Majd, Elham He, Hua Gu, Junhua Zhang, Xuekui
author_sort	Cao, Xiaowen
collection	PubMed
description	The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and this has enabled researchers to train supervised machine learning models for predicting or classifying various cell-level phenotypes. This has led to the development of many new methods for analyzing scRNA-seq data. Despite the popularity of such applications, there has as yet been no systematic investigation of the performance of these supervised algorithms using predictors from various sizes of scRNA-seq datasets. In this study, 13 popular supervised machine learning algorithms for cell phenotype classification were evaluated using published real and simulated datasets with diverse cell sizes. This benchmark comprises two parts. In the first, real datasets were used to assess the computing speed and cell phenotype classification performance of popular supervised algorithms. The classification performances were evaluated using the area under the receiver operating characteristic curve, F1-score, Precision, Recall, and false-positive rate. In the second part, we evaluated gene-selection performance using published simulated datasets with a known list of real genes. The results showed that ElasticNet with interactions performed the best for small and medium-sized datasets. The NaiveBayes classifier was found to be another appropriate method for medium-sized datasets. With large datasets, the performance of the XGBoost algorithm was found to be excellent. Ensemble algorithms were not found to be significantly superior to individual machine learning methods. Including interactions in the ElasticNet algorithm caused a significant performance improvement for small datasets. The linear discriminant analysis algorithm was found to be the best choice when speed is critical; it is the fastest method, it can scale to handle large sample sizes, and its performance is not much worse than the top performers.
format	Online Article Text
id	pubmed-8905542
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-89055422022-03-10 A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data Cao, Xiaowen Xing, Li Majd, Elham He, Hua Gu, Junhua Zhang, Xuekui Front Genet Genetics The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and this has enabled researchers to train supervised machine learning models for predicting or classifying various cell-level phenotypes. This has led to the development of many new methods for analyzing scRNA-seq data. Despite the popularity of such applications, there has as yet been no systematic investigation of the performance of these supervised algorithms using predictors from various sizes of scRNA-seq datasets. In this study, 13 popular supervised machine learning algorithms for cell phenotype classification were evaluated using published real and simulated datasets with diverse cell sizes. This benchmark comprises two parts. In the first, real datasets were used to assess the computing speed and cell phenotype classification performance of popular supervised algorithms. The classification performances were evaluated using the area under the receiver operating characteristic curve, F1-score, Precision, Recall, and false-positive rate. In the second part, we evaluated gene-selection performance using published simulated datasets with a known list of real genes. The results showed that ElasticNet with interactions performed the best for small and medium-sized datasets. The NaiveBayes classifier was found to be another appropriate method for medium-sized datasets. With large datasets, the performance of the XGBoost algorithm was found to be excellent. Ensemble algorithms were not found to be significantly superior to individual machine learning methods. Including interactions in the ElasticNet algorithm caused a significant performance improvement for small datasets. The linear discriminant analysis algorithm was found to be the best choice when speed is critical; it is the fastest method, it can scale to handle large sample sizes, and its performance is not much worse than the top performers. Frontiers Media S.A. 2022-02-23 /pmc/articles/PMC8905542/ /pubmed/35281805 http://dx.doi.org/10.3389/fgene.2022.836798 Text en Copyright © 2022 Cao, Xing, Majd, He, Gu and Zhang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Cao, Xiaowen Xing, Li Majd, Elham He, Hua Gu, Junhua Zhang, Xuekui A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data
title	A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data
title_full	A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data
title_fullStr	A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data
title_full_unstemmed	A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data
title_short	A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data
title_sort	systematic evaluation of supervised machine learning algorithms for cell phenotype classification using single-cell rna sequencing data
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8905542/ https://www.ncbi.nlm.nih.gov/pubmed/35281805 http://dx.doi.org/10.3389/fgene.2022.836798
work_keys_str_mv	AT caoxiaowen asystematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT xingli asystematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT majdelham asystematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT hehua asystematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT gujunhua asystematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT zhangxuekui asystematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT caoxiaowen systematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT xingli systematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT majdelham systematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT hehua systematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT gujunhua systematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata AT zhangxuekui systematicevaluationofsupervisedmachinelearningalgorithmsforcellphenotypeclassificationusingsinglecellrnasequencingdata

A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data

Ejemplares similares