Cargando…

Machine learning for cell type classification from single nucleus RNA sequencing data

With the advent of single cell/nucleus RNA sequencing (sc/snRNA-seq), the field of cell phenotyping is now a data-driven exercise providing statistical evidence to support cell type/state categorization. However, the task of classifying cells into specific, well-defined categories with the empirical...

Descripción completa

Detalles Bibliográficos
Autores principales:	Le, Huy, Peng, Beverly, Uy, Janelle, Carrillo, Daniel, Zhang, Yun, Aevermann, Brian D., Scheuermann, Richard H.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9506651/ https://www.ncbi.nlm.nih.gov/pubmed/36149937 http://dx.doi.org/10.1371/journal.pone.0275070

_version_	1784796775902609408
author	Le, Huy Peng, Beverly Uy, Janelle Carrillo, Daniel Zhang, Yun Aevermann, Brian D. Scheuermann, Richard H.
author_facet	Le, Huy Peng, Beverly Uy, Janelle Carrillo, Daniel Zhang, Yun Aevermann, Brian D. Scheuermann, Richard H.
author_sort	Le, Huy
collection	PubMed
description	With the advent of single cell/nucleus RNA sequencing (sc/snRNA-seq), the field of cell phenotyping is now a data-driven exercise providing statistical evidence to support cell type/state categorization. However, the task of classifying cells into specific, well-defined categories with the empirical data provided by sc/snRNA-seq remains nontrivial due to the difficulty in determining specific differences between related cell types with close transcriptional similarities, resulting in challenges with matching cell types identified in separate experiments. To investigate possible approaches to overcome these obstacles, we explored the use of supervised machine learning methods—logistic regression, support vector machines, random forests, neural networks, and light gradient boosting machine (LightGBM)–as approaches to classify cell types using snRNA-seq datasets from human brain middle temporal gyrus (MTG) and human kidney. Classification accuracy was evaluated using an F-beta score weighted in favor of precision to account for technical artifacts of gene expression dropout. We examined the impact of hyperparameter optimization and feature selection methods on F-beta score performance. We found that the best performing model for granular cell type classification in both datasets is a multinomial logistic regression classifier and that an effective feature selection step was the most influential factor in optimizing the performance of the machine learning pipelines.
format	Online Article Text
id	pubmed-9506651
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-95066512022-09-24 Machine learning for cell type classification from single nucleus RNA sequencing data Le, Huy Peng, Beverly Uy, Janelle Carrillo, Daniel Zhang, Yun Aevermann, Brian D. Scheuermann, Richard H. PLoS One Research Article With the advent of single cell/nucleus RNA sequencing (sc/snRNA-seq), the field of cell phenotyping is now a data-driven exercise providing statistical evidence to support cell type/state categorization. However, the task of classifying cells into specific, well-defined categories with the empirical data provided by sc/snRNA-seq remains nontrivial due to the difficulty in determining specific differences between related cell types with close transcriptional similarities, resulting in challenges with matching cell types identified in separate experiments. To investigate possible approaches to overcome these obstacles, we explored the use of supervised machine learning methods—logistic regression, support vector machines, random forests, neural networks, and light gradient boosting machine (LightGBM)–as approaches to classify cell types using snRNA-seq datasets from human brain middle temporal gyrus (MTG) and human kidney. Classification accuracy was evaluated using an F-beta score weighted in favor of precision to account for technical artifacts of gene expression dropout. We examined the impact of hyperparameter optimization and feature selection methods on F-beta score performance. We found that the best performing model for granular cell type classification in both datasets is a multinomial logistic regression classifier and that an effective feature selection step was the most influential factor in optimizing the performance of the machine learning pipelines. Public Library of Science 2022-09-23 /pmc/articles/PMC9506651/ /pubmed/36149937 http://dx.doi.org/10.1371/journal.pone.0275070 Text en © 2022 Le et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Le, Huy Peng, Beverly Uy, Janelle Carrillo, Daniel Zhang, Yun Aevermann, Brian D. Scheuermann, Richard H. Machine learning for cell type classification from single nucleus RNA sequencing data
title	Machine learning for cell type classification from single nucleus RNA sequencing data
title_full	Machine learning for cell type classification from single nucleus RNA sequencing data
title_fullStr	Machine learning for cell type classification from single nucleus RNA sequencing data
title_full_unstemmed	Machine learning for cell type classification from single nucleus RNA sequencing data
title_short	Machine learning for cell type classification from single nucleus RNA sequencing data
title_sort	machine learning for cell type classification from single nucleus rna sequencing data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9506651/ https://www.ncbi.nlm.nih.gov/pubmed/36149937 http://dx.doi.org/10.1371/journal.pone.0275070
work_keys_str_mv	AT lehuy machinelearningforcelltypeclassificationfromsinglenucleusrnasequencingdata AT pengbeverly machinelearningforcelltypeclassificationfromsinglenucleusrnasequencingdata AT uyjanelle machinelearningforcelltypeclassificationfromsinglenucleusrnasequencingdata AT carrillodaniel machinelearningforcelltypeclassificationfromsinglenucleusrnasequencingdata AT zhangyun machinelearningforcelltypeclassificationfromsinglenucleusrnasequencingdata AT aevermannbriand machinelearningforcelltypeclassificationfromsinglenucleusrnasequencingdata AT scheuermannrichardh machinelearningforcelltypeclassificationfromsinglenucleusrnasequencingdata

Machine learning for cell type classification from single nucleus RNA sequencing data

Ejemplares similares