Cargando…

Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features

MOTIVATION: Random forest is a popular machine learning approach for the analysis of high-dimensional data because it is flexible and provides variable importance measures for the selection of relevant features. However, the complex relationships between the features are usually not considered for t...

Descripción completa

Detalles Bibliográficos
Autores principales: Voges, Lucas F, Jarren, Lukas C, Seifert, Stephan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403431/
https://www.ncbi.nlm.nih.gov/pubmed/37522865
http://dx.doi.org/10.1093/bioinformatics/btad471
_version_ 1785085067377246208
author Voges, Lucas F
Jarren, Lukas C
Seifert, Stephan
author_facet Voges, Lucas F
Jarren, Lukas C
Seifert, Stephan
author_sort Voges, Lucas F
collection PubMed
description MOTIVATION: Random forest is a popular machine learning approach for the analysis of high-dimensional data because it is flexible and provides variable importance measures for the selection of relevant features. However, the complex relationships between the features are usually not considered for the selection and thus also neglected for the characterization of the analysed samples. RESULTS: Here we propose two novel approaches that focus on the mutual impact of features in random forests. Mutual forest impact (MFI) is a relation parameter that evaluates the mutual association of the features to the outcome and, hence, goes beyond the analysis of correlation coefficients. Mutual impurity reduction (MIR) is an importance measure that combines this relation parameter with the importance of the individual features. MIR and MFI are implemented together with testing procedures that generate P-values for the selection of related and important features. Applications to one experimental and various simulated datasets and the comparison to other methods for feature selection and relation analysis show that MFI and MIR are very promising to shed light on the complex relationships between features and outcome. In addition, they are not affected by common biases, e.g. that features with many possible splits or high minor allele frequencies are preferred. AVAILABILITY AND IMPLEMENTATION: The approaches are implemented in Version 0.3.3 of the R package RFSurrogates that is available at github.com/AGSeifert/RFSurrogates and the data are available at doi.org/10.25592/uhhfdm.12620.
format Online
Article
Text
id pubmed-10403431
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-104034312023-08-06 Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features Voges, Lucas F Jarren, Lukas C Seifert, Stephan Bioinformatics Original Paper MOTIVATION: Random forest is a popular machine learning approach for the analysis of high-dimensional data because it is flexible and provides variable importance measures for the selection of relevant features. However, the complex relationships between the features are usually not considered for the selection and thus also neglected for the characterization of the analysed samples. RESULTS: Here we propose two novel approaches that focus on the mutual impact of features in random forests. Mutual forest impact (MFI) is a relation parameter that evaluates the mutual association of the features to the outcome and, hence, goes beyond the analysis of correlation coefficients. Mutual impurity reduction (MIR) is an importance measure that combines this relation parameter with the importance of the individual features. MIR and MFI are implemented together with testing procedures that generate P-values for the selection of related and important features. Applications to one experimental and various simulated datasets and the comparison to other methods for feature selection and relation analysis show that MFI and MIR are very promising to shed light on the complex relationships between features and outcome. In addition, they are not affected by common biases, e.g. that features with many possible splits or high minor allele frequencies are preferred. AVAILABILITY AND IMPLEMENTATION: The approaches are implemented in Version 0.3.3 of the R package RFSurrogates that is available at github.com/AGSeifert/RFSurrogates and the data are available at doi.org/10.25592/uhhfdm.12620. Oxford University Press 2023-07-31 /pmc/articles/PMC10403431/ /pubmed/37522865 http://dx.doi.org/10.1093/bioinformatics/btad471 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Voges, Lucas F
Jarren, Lukas C
Seifert, Stephan
Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
title Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
title_full Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
title_fullStr Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
title_full_unstemmed Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
title_short Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
title_sort exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403431/
https://www.ncbi.nlm.nih.gov/pubmed/37522865
http://dx.doi.org/10.1093/bioinformatics/btad471
work_keys_str_mv AT vogeslucasf exploitationofsurrogatevariablesinrandomforestsforunbiasedanalysisofmutualimpactandimportanceoffeatures
AT jarrenlukasc exploitationofsurrogatevariablesinrandomforestsforunbiasedanalysisofmutualimpactandimportanceoffeatures
AT seifertstephan exploitationofsurrogatevariablesinrandomforestsforunbiasedanalysisofmutualimpactandimportanceoffeatures