Cargando…

Calibrating random forests for probability estimation

Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. In this work, we present two approaches for updating random forests for probability estimation. The first...

Descripción completa

Detalles Bibliográficos
Autores principales: Dankowski, Theresa, Ziegler, Andreas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: John Wiley and Sons Inc. 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5074325/
https://www.ncbi.nlm.nih.gov/pubmed/27074747
http://dx.doi.org/10.1002/sim.6959
_version_ 1782461714419154944
author Dankowski, Theresa
Ziegler, Andreas
author_facet Dankowski, Theresa
Ziegler, Andreas
author_sort Dankowski, Theresa
collection PubMed
description Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. In this work, we present two approaches for updating random forests for probability estimation. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so‐called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models. These are, in turn, used for re‐calibration. The two updating strategies were compared in a simulation study and are illustrated with data from the German Stroke Study Collaboration. In most simulation scenarios, both methods led to similar improvements. In the simulation scenario in which the stricter assumptions of Elkan's method were not met, the logistic regression‐based re‐calibration approach for random forests outperformed Elkan's method. It also performed better on the stroke data than Elkan's method. The strength of Elkan's method is its general applicability to any probability machine. However, if the strict assumptions underlying this approach are not met, the logistic regression‐based approach is preferable for updating random forests for probability estimation. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.
format Online
Article
Text
id pubmed-5074325
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher John Wiley and Sons Inc.
record_format MEDLINE/PubMed
spelling pubmed-50743252016-11-04 Calibrating random forests for probability estimation Dankowski, Theresa Ziegler, Andreas Stat Med Research Articles Probabilities can be consistently estimated using random forests. It is, however, unclear how random forests should be updated to make predictions for other centers or at different time points. In this work, we present two approaches for updating random forests for probability estimation. The first method has been proposed by Elkan and may be used for updating any machine learning approach yielding consistent probabilities, so‐called probability machines. The second approach is a new strategy specifically developed for random forests. Using the terminal nodes, which represent conditional probabilities, the random forest is first translated to logistic regression models. These are, in turn, used for re‐calibration. The two updating strategies were compared in a simulation study and are illustrated with data from the German Stroke Study Collaboration. In most simulation scenarios, both methods led to similar improvements. In the simulation scenario in which the stricter assumptions of Elkan's method were not met, the logistic regression‐based re‐calibration approach for random forests outperformed Elkan's method. It also performed better on the stroke data than Elkan's method. The strength of Elkan's method is its general applicability to any probability machine. However, if the strict assumptions underlying this approach are not met, the logistic regression‐based approach is preferable for updating random forests for probability estimation. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. John Wiley and Sons Inc. 2016-04-13 2016-09-30 /pmc/articles/PMC5074325/ /pubmed/27074747 http://dx.doi.org/10.1002/sim.6959 Text en © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs (http://creativecommons.org/licenses/by-nc-nd/4.0/) License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
spellingShingle Research Articles
Dankowski, Theresa
Ziegler, Andreas
Calibrating random forests for probability estimation
title Calibrating random forests for probability estimation
title_full Calibrating random forests for probability estimation
title_fullStr Calibrating random forests for probability estimation
title_full_unstemmed Calibrating random forests for probability estimation
title_short Calibrating random forests for probability estimation
title_sort calibrating random forests for probability estimation
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5074325/
https://www.ncbi.nlm.nih.gov/pubmed/27074747
http://dx.doi.org/10.1002/sim.6959
work_keys_str_mv AT dankowskitheresa calibratingrandomforestsforprobabilityestimation
AT zieglerandreas calibratingrandomforestsforprobabilityestimation