Cargando…

Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning

Machine learning techniques are excellent to analyze expression data from single cells. These techniques impact all fields ranging from cell annotation and clustering to signature identification. The presented framework evaluates gene selection sets how far they optimally separate defined phenotypes...

Descripción completa

Detalles Bibliográficos
Autores principales:	Caliskan, Aylin, Caliskan, Deniz, Rasbach, Lauritz, Yu, Weimeng, Dandekar, Thomas, Breitenbach, Tim
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Research Network of Computational and Structural Biotechnology 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276237/ https://www.ncbi.nlm.nih.gov/pubmed/37333862 http://dx.doi.org/10.1016/j.csbj.2023.06.002

_version_	1785060033680113664
author	Caliskan, Aylin Caliskan, Deniz Rasbach, Lauritz Yu, Weimeng Dandekar, Thomas Breitenbach, Tim
author_facet	Caliskan, Aylin Caliskan, Deniz Rasbach, Lauritz Yu, Weimeng Dandekar, Thomas Breitenbach, Tim
author_sort	Caliskan, Aylin
collection	PubMed
description	Machine learning techniques are excellent to analyze expression data from single cells. These techniques impact all fields ranging from cell annotation and clustering to signature identification. The presented framework evaluates gene selection sets how far they optimally separate defined phenotypes or cell groups. This innovation overcomes the present limitation to objectively and correctly identify a small gene set of high information content regarding separating phenotypes for which corresponding code scripts are provided. The small but meaningful subset of the original genes (or feature space) facilitates human interpretability of the differences of the phenotypes including those found by machine learning results and may even turn correlations between genes and phenotypes into a causal explanation. For the feature selection task, the principal feature analysis is utilized which reduces redundant information while selecting genes that carry the information for separating the phenotypes. In this context, the presented framework shows explainability of unsupervised learning as it reveals cell-type specific signatures. Apart from a Seurat preprocessing tool and the PFA script, the pipeline uses mutual information to balance accuracy and size of the gene set if desired. A validation part to evaluate the gene selection for their information content regarding the separation of the phenotypes is provided as well, binary and multiclass classification of 3 or 4 groups are studied. Results from different single-cell data are presented. In each, only about ten out of more than 30000 genes are identified as carrying the relevant information. The code is provided in a GitHub repository at https://github.com/AC-PHD/Seurat_PFA_pipeline.
format	Online Article Text
id	pubmed-10276237
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Research Network of Computational and Structural Biotechnology
record_format	MEDLINE/PubMed
spelling	pubmed-102762372023-06-18 Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning Caliskan, Aylin Caliskan, Deniz Rasbach, Lauritz Yu, Weimeng Dandekar, Thomas Breitenbach, Tim Comput Struct Biotechnol J Research Article Machine learning techniques are excellent to analyze expression data from single cells. These techniques impact all fields ranging from cell annotation and clustering to signature identification. The presented framework evaluates gene selection sets how far they optimally separate defined phenotypes or cell groups. This innovation overcomes the present limitation to objectively and correctly identify a small gene set of high information content regarding separating phenotypes for which corresponding code scripts are provided. The small but meaningful subset of the original genes (or feature space) facilitates human interpretability of the differences of the phenotypes including those found by machine learning results and may even turn correlations between genes and phenotypes into a causal explanation. For the feature selection task, the principal feature analysis is utilized which reduces redundant information while selecting genes that carry the information for separating the phenotypes. In this context, the presented framework shows explainability of unsupervised learning as it reveals cell-type specific signatures. Apart from a Seurat preprocessing tool and the PFA script, the pipeline uses mutual information to balance accuracy and size of the gene set if desired. A validation part to evaluate the gene selection for their information content regarding the separation of the phenotypes is provided as well, binary and multiclass classification of 3 or 4 groups are studied. Results from different single-cell data are presented. In each, only about ten out of more than 30000 genes are identified as carrying the relevant information. The code is provided in a GitHub repository at https://github.com/AC-PHD/Seurat_PFA_pipeline. Research Network of Computational and Structural Biotechnology 2023-06-05 /pmc/articles/PMC10276237/ /pubmed/37333862 http://dx.doi.org/10.1016/j.csbj.2023.06.002 Text en © 2023 The Author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Research Article Caliskan, Aylin Caliskan, Deniz Rasbach, Lauritz Yu, Weimeng Dandekar, Thomas Breitenbach, Tim Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning
title	Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning
title_full	Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning
title_fullStr	Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning
title_full_unstemmed	Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning
title_short	Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning
title_sort	optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10276237/ https://www.ncbi.nlm.nih.gov/pubmed/37333862 http://dx.doi.org/10.1016/j.csbj.2023.06.002
work_keys_str_mv	AT caliskanaylin optimizedcelltypesignaturesrevealedfromsinglecelldatabycombiningprincipalfeatureanalysismutualinformationandmachinelearning AT caliskandeniz optimizedcelltypesignaturesrevealedfromsinglecelldatabycombiningprincipalfeatureanalysismutualinformationandmachinelearning AT rasbachlauritz optimizedcelltypesignaturesrevealedfromsinglecelldatabycombiningprincipalfeatureanalysismutualinformationandmachinelearning AT yuweimeng optimizedcelltypesignaturesrevealedfromsinglecelldatabycombiningprincipalfeatureanalysismutualinformationandmachinelearning AT dandekarthomas optimizedcelltypesignaturesrevealedfromsinglecelldatabycombiningprincipalfeatureanalysismutualinformationandmachinelearning AT breitenbachtim optimizedcelltypesignaturesrevealedfromsinglecelldatabycombiningprincipalfeatureanalysismutualinformationandmachinelearning

Optimized cell type signatures revealed from single-cell data by combining principal feature analysis, mutual information, and machine learning

Ejemplares similares