Cargando…

Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach

Random forests (RFs) are a widely used modelling tool capable of feature selection via a variable importance measure (VIM), however, a threshold is needed to control for false positives. In the absence of a good understanding of the characteristics of VIMs, many current approaches attempt to select...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dunne, Robert, Reguant, Roc, Ramarao-Milne, Priya, Szul, Piotr, Sng, Letitia M.F., Lundberg, Mischa, Twine, Natalie A., Bauer, Denis C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Research Network of Computational and Structural Biotechnology 2023
Materias:	Method Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10497997/ https://www.ncbi.nlm.nih.gov/pubmed/37711185 http://dx.doi.org/10.1016/j.csbj.2023.08.033

_version_	1785105425988845568
author	Dunne, Robert Reguant, Roc Ramarao-Milne, Priya Szul, Piotr Sng, Letitia M.F. Lundberg, Mischa Twine, Natalie A. Bauer, Denis C.
author_facet	Dunne, Robert Reguant, Roc Ramarao-Milne, Priya Szul, Piotr Sng, Letitia M.F. Lundberg, Mischa Twine, Natalie A. Bauer, Denis C.
author_sort	Dunne, Robert
collection	PubMed
description	Random forests (RFs) are a widely used modelling tool capable of feature selection via a variable importance measure (VIM), however, a threshold is needed to control for false positives. In the absence of a good understanding of the characteristics of VIMs, many current approaches attempt to select features associated to the response by training multiple RFs to generate statistical power via a permutation null, by employing recursive feature elimination, or through a combination of both. However, for high-dimensional datasets these approaches become computationally infeasible. In this paper, we present RFlocalfdr, a statistical approach, built on the empirical Bayes argument of Efron, for thresholding mean decrease in impurity (MDI) importances. It identifies features significantly associated with the response while controlling the false positive rate. Using synthetic data and real-world data in health, we demonstrate that RFlocalfdr has equivalent accuracy to currently published approaches, while being orders of magnitude faster. We show that RFlocalfdr can successfully threshold a dataset of 10(6) datapoints, establishing its usability for large-scale datasets, like genomics. Furthermore, RFlocalfdr is compatible with any RF implementation that returns a VIM and counts, making it a versatile feature selection tool that reduces false discoveries.
format	Online Article Text
id	pubmed-10497997
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Research Network of Computational and Structural Biotechnology
record_format	MEDLINE/PubMed
spelling	pubmed-104979972023-09-14 Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach Dunne, Robert Reguant, Roc Ramarao-Milne, Priya Szul, Piotr Sng, Letitia M.F. Lundberg, Mischa Twine, Natalie A. Bauer, Denis C. Comput Struct Biotechnol J Method Article Random forests (RFs) are a widely used modelling tool capable of feature selection via a variable importance measure (VIM), however, a threshold is needed to control for false positives. In the absence of a good understanding of the characteristics of VIMs, many current approaches attempt to select features associated to the response by training multiple RFs to generate statistical power via a permutation null, by employing recursive feature elimination, or through a combination of both. However, for high-dimensional datasets these approaches become computationally infeasible. In this paper, we present RFlocalfdr, a statistical approach, built on the empirical Bayes argument of Efron, for thresholding mean decrease in impurity (MDI) importances. It identifies features significantly associated with the response while controlling the false positive rate. Using synthetic data and real-world data in health, we demonstrate that RFlocalfdr has equivalent accuracy to currently published approaches, while being orders of magnitude faster. We show that RFlocalfdr can successfully threshold a dataset of 10(6) datapoints, establishing its usability for large-scale datasets, like genomics. Furthermore, RFlocalfdr is compatible with any RF implementation that returns a VIM and counts, making it a versatile feature selection tool that reduces false discoveries. Research Network of Computational and Structural Biotechnology 2023-09-01 /pmc/articles/PMC10497997/ /pubmed/37711185 http://dx.doi.org/10.1016/j.csbj.2023.08.033 Text en Crown Copyright © 2023 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle	Method Article Dunne, Robert Reguant, Roc Ramarao-Milne, Priya Szul, Piotr Sng, Letitia M.F. Lundberg, Mischa Twine, Natalie A. Bauer, Denis C. Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach
title	Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach
title_full	Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach
title_fullStr	Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach
title_full_unstemmed	Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach
title_short	Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach
title_sort	thresholding gini variable importance with a single-trained random forest: an empirical bayes approach
topic	Method Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10497997/ https://www.ncbi.nlm.nih.gov/pubmed/37711185 http://dx.doi.org/10.1016/j.csbj.2023.08.033
work_keys_str_mv	AT dunnerobert thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach AT reguantroc thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach AT ramaraomilnepriya thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach AT szulpiotr thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach AT sngletitiamf thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach AT lundbergmischa thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach AT twinenataliea thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach AT bauerdenisc thresholdingginivariableimportancewithasingletrainedrandomforestanempiricalbayesapproach

Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach

Ejemplares similares