Cargando…

Intervention in prediction measure: a new approach to assessing variable importance for random forests

BACKGROUND: Random forests are a popular method in many fields since they can be successfully applied to complex data, with a small sample size, complex interactions and correlations, mixed type predictors, etc. Furthermore, they provide variable importance measures that aid qualitative interpretati...

Descripción completa

Detalles Bibliográficos
Autor principal: Epifanio, Irene
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5414143/
https://www.ncbi.nlm.nih.gov/pubmed/28464827
http://dx.doi.org/10.1186/s12859-017-1650-8
_version_ 1783233305105137664
author Epifanio, Irene
author_facet Epifanio, Irene
author_sort Epifanio, Irene
collection PubMed
description BACKGROUND: Random forests are a popular method in many fields since they can be successfully applied to complex data, with a small sample size, complex interactions and correlations, mixed type predictors, etc. Furthermore, they provide variable importance measures that aid qualitative interpretation and also the selection of relevant predictors. However, most of these measures rely on the choice of a performance measure. But measures of prediction performance are not unique or there is not even a clear definition, as in the case of multivariate response random forests. METHODS: A new alternative importance measure, called Intervention in Prediction Measure, is investigated. It depends on the structure of the trees, without depending on performance measures. It is compared with other well-known variable importance measures in different contexts, such as a classification problem with variables of different types, another classification problem with correlated predictor variables, and problems with multivariate responses and predictors of different types. RESULTS: Several simulation studies are carried out, showing the new measure to be very competitive. In addition, it is applied in two well-known bioinformatics applications previously used in other papers. Improvements in performance are also provided for these applications by the use of this new measure. CONCLUSIONS: This new measure is expressed as a percentage, which makes it attractive in terms of interpretability. It can be used with new observations. It can be defined globally, for each class (in a classification problem) and case-wise. It can easily be computed for any kind of response, including multivariate responses. Furthermore, it can be used with any algorithm employed to grow each individual tree. It can be used in place of (or in addition to) other variable importance measures. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1650-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5414143
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-54141432017-05-03 Intervention in prediction measure: a new approach to assessing variable importance for random forests Epifanio, Irene BMC Bioinformatics Methodology Article BACKGROUND: Random forests are a popular method in many fields since they can be successfully applied to complex data, with a small sample size, complex interactions and correlations, mixed type predictors, etc. Furthermore, they provide variable importance measures that aid qualitative interpretation and also the selection of relevant predictors. However, most of these measures rely on the choice of a performance measure. But measures of prediction performance are not unique or there is not even a clear definition, as in the case of multivariate response random forests. METHODS: A new alternative importance measure, called Intervention in Prediction Measure, is investigated. It depends on the structure of the trees, without depending on performance measures. It is compared with other well-known variable importance measures in different contexts, such as a classification problem with variables of different types, another classification problem with correlated predictor variables, and problems with multivariate responses and predictors of different types. RESULTS: Several simulation studies are carried out, showing the new measure to be very competitive. In addition, it is applied in two well-known bioinformatics applications previously used in other papers. Improvements in performance are also provided for these applications by the use of this new measure. CONCLUSIONS: This new measure is expressed as a percentage, which makes it attractive in terms of interpretability. It can be used with new observations. It can be defined globally, for each class (in a classification problem) and case-wise. It can easily be computed for any kind of response, including multivariate responses. Furthermore, it can be used with any algorithm employed to grow each individual tree. It can be used in place of (or in addition to) other variable importance measures. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1650-8) contains supplementary material, which is available to authorized users. BioMed Central 2017-05-02 /pmc/articles/PMC5414143/ /pubmed/28464827 http://dx.doi.org/10.1186/s12859-017-1650-8 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Epifanio, Irene
Intervention in prediction measure: a new approach to assessing variable importance for random forests
title Intervention in prediction measure: a new approach to assessing variable importance for random forests
title_full Intervention in prediction measure: a new approach to assessing variable importance for random forests
title_fullStr Intervention in prediction measure: a new approach to assessing variable importance for random forests
title_full_unstemmed Intervention in prediction measure: a new approach to assessing variable importance for random forests
title_short Intervention in prediction measure: a new approach to assessing variable importance for random forests
title_sort intervention in prediction measure: a new approach to assessing variable importance for random forests
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5414143/
https://www.ncbi.nlm.nih.gov/pubmed/28464827
http://dx.doi.org/10.1186/s12859-017-1650-8
work_keys_str_mv AT epifanioirene interventioninpredictionmeasureanewapproachtoassessingvariableimportanceforrandomforests