Cargando…
Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of featur...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10377046/ https://www.ncbi.nlm.nih.gov/pubmed/37509188 http://dx.doi.org/10.3390/biom13071153 |
_version_ | 1785079419675607040 |
---|---|
author | Tian, Leqi Wu, Wenbin Yu, Tianwei |
author_facet | Tian, Leqi Wu, Wenbin Yu, Tianwei |
author_sort | Tian, Leqi |
collection | PubMed |
description | Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features [Formula: see text] compared to the size of samples [Formula: see text]. Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets—non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures. |
format | Online Article Text |
id | pubmed-10377046 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-103770462023-07-29 Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features Tian, Leqi Wu, Wenbin Yu, Tianwei Biomolecules Article Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features [Formula: see text] compared to the size of samples [Formula: see text]. Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets—non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures. MDPI 2023-07-20 /pmc/articles/PMC10377046/ /pubmed/37509188 http://dx.doi.org/10.3390/biom13071153 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Tian, Leqi Wu, Wenbin Yu, Tianwei Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features |
title | Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features |
title_full | Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features |
title_fullStr | Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features |
title_full_unstemmed | Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features |
title_short | Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features |
title_sort | graph random forest: a graph embedded algorithm for identifying highly connected important features |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10377046/ https://www.ncbi.nlm.nih.gov/pubmed/37509188 http://dx.doi.org/10.3390/biom13071153 |
work_keys_str_mv | AT tianleqi graphrandomforestagraphembeddedalgorithmforidentifyinghighlyconnectedimportantfeatures AT wuwenbin graphrandomforestagraphembeddedalgorithmforidentifyinghighlyconnectedimportantfeatures AT yutianwei graphrandomforestagraphembeddedalgorithmforidentifyinghighlyconnectedimportantfeatures |