Cargando…

Provable Boolean interaction recovery from tree ensemble obtained via random forests

Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have show...

Descripción completa

Detalles Bibliográficos
Autores principales: Behr, Merle, Wang, Yu, Li, Xiao, Yu, Bin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9295780/
https://www.ncbi.nlm.nih.gov/pubmed/35609192
http://dx.doi.org/10.1073/pnas.2118636119
_version_ 1784750124739592192
author Behr, Merle
Wang, Yu
Li, Xiao
Yu, Bin
author_facet Behr, Merle
Wang, Yu
Li, Xiao
Yu, Bin
author_sort Behr, Merle
collection PubMed
description Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features [Formula: see text]. Intuitively speaking, DWP([Formula: see text]) measures how frequently features in [Formula: see text] appear together in an RF tree ensemble. We prove that, with high probability, DWP([Formula: see text]) attains a universal upper bound that does not involve any model coefficients, if and only if [Formula: see text] corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.
format Online
Article
Text
id pubmed-9295780
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-92957802022-07-20 Provable Boolean interaction recovery from tree ensemble obtained via random forests Behr, Merle Wang, Yu Li, Xiao Yu, Bin Proc Natl Acad Sci U S A Physical Sciences Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features [Formula: see text]. Intuitively speaking, DWP([Formula: see text]) measures how frequently features in [Formula: see text] appear together in an RF tree ensemble. We prove that, with high probability, DWP([Formula: see text]) attains a universal upper bound that does not involve any model coefficients, if and only if [Formula: see text] corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated. National Academy of Sciences 2022-05-24 2022-05-31 /pmc/articles/PMC9295780/ /pubmed/35609192 http://dx.doi.org/10.1073/pnas.2118636119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by/4.0/This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY) (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Physical Sciences
Behr, Merle
Wang, Yu
Li, Xiao
Yu, Bin
Provable Boolean interaction recovery from tree ensemble obtained via random forests
title Provable Boolean interaction recovery from tree ensemble obtained via random forests
title_full Provable Boolean interaction recovery from tree ensemble obtained via random forests
title_fullStr Provable Boolean interaction recovery from tree ensemble obtained via random forests
title_full_unstemmed Provable Boolean interaction recovery from tree ensemble obtained via random forests
title_short Provable Boolean interaction recovery from tree ensemble obtained via random forests
title_sort provable boolean interaction recovery from tree ensemble obtained via random forests
topic Physical Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9295780/
https://www.ncbi.nlm.nih.gov/pubmed/35609192
http://dx.doi.org/10.1073/pnas.2118636119
work_keys_str_mv AT behrmerle provablebooleaninteractionrecoveryfromtreeensembleobtainedviarandomforests
AT wangyu provablebooleaninteractionrecoveryfromtreeensembleobtainedviarandomforests
AT lixiao provablebooleaninteractionrecoveryfromtreeensembleobtainedviarandomforests
AT yubin provablebooleaninteractionrecoveryfromtreeensembleobtainedviarandomforests