Cargando…

Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features

Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of featur...

Descripción completa

Detalles Bibliográficos
Autores principales: Tian, Leqi, Wu, Wenbin, Yu, Tianwei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10377046/
https://www.ncbi.nlm.nih.gov/pubmed/37509188
http://dx.doi.org/10.3390/biom13071153
_version_ 1785079419675607040
author Tian, Leqi
Wu, Wenbin
Yu, Tianwei
author_facet Tian, Leqi
Wu, Wenbin
Yu, Tianwei
author_sort Tian, Leqi
collection PubMed
description Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features [Formula: see text] compared to the size of samples [Formula: see text]. Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets—non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures.
format Online
Article
Text
id pubmed-10377046
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-103770462023-07-29 Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features Tian, Leqi Wu, Wenbin Yu, Tianwei Biomolecules Article Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features [Formula: see text] compared to the size of samples [Formula: see text]. Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets—non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures. MDPI 2023-07-20 /pmc/articles/PMC10377046/ /pubmed/37509188 http://dx.doi.org/10.3390/biom13071153 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Tian, Leqi
Wu, Wenbin
Yu, Tianwei
Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
title Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
title_full Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
title_fullStr Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
title_full_unstemmed Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
title_short Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
title_sort graph random forest: a graph embedded algorithm for identifying highly connected important features
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10377046/
https://www.ncbi.nlm.nih.gov/pubmed/37509188
http://dx.doi.org/10.3390/biom13071153
work_keys_str_mv AT tianleqi graphrandomforestagraphembeddedalgorithmforidentifyinghighlyconnectedimportantfeatures
AT wuwenbin graphrandomforestagraphembeddedalgorithmforidentifyinghighlyconnectedimportantfeatures
AT yutianwei graphrandomforestagraphembeddedalgorithmforidentifyinghighlyconnectedimportantfeatures