Cargando…

On the overestimation of random forest’s out-of-bag error

The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Janitza, Silke, Hornung, Roman
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6078316/ https://www.ncbi.nlm.nih.gov/pubmed/30080866 http://dx.doi.org/10.1371/journal.pone.0201904

_version_	1783345071239725056
author	Janitza, Silke Hornung, Roman
author_facet	Janitza, Silke Hornung, Roman
author_sort	Janitza, Silke
collection	PubMed
description	The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.
format	Online Article Text
id	pubmed-6078316
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-60783162018-08-28 On the overestimation of random forest’s out-of-bag error Janitza, Silke Hornung, Roman PLoS One Research Article The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative. Public Library of Science 2018-08-06 /pmc/articles/PMC6078316/ /pubmed/30080866 http://dx.doi.org/10.1371/journal.pone.0201904 Text en © 2018 Janitza, Hornung http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Janitza, Silke Hornung, Roman On the overestimation of random forest’s out-of-bag error
title	On the overestimation of random forest’s out-of-bag error
title_full	On the overestimation of random forest’s out-of-bag error
title_fullStr	On the overestimation of random forest’s out-of-bag error
title_full_unstemmed	On the overestimation of random forest’s out-of-bag error
title_short	On the overestimation of random forest’s out-of-bag error
title_sort	on the overestimation of random forest’s out-of-bag error
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6078316/ https://www.ncbi.nlm.nih.gov/pubmed/30080866 http://dx.doi.org/10.1371/journal.pone.0201904
work_keys_str_mv	AT janitzasilke ontheoverestimationofrandomforestsoutofbagerror AT hornungroman ontheoverestimationofrandomforestsoutofbagerror

On the overestimation of random forest’s out-of-bag error

Ejemplares similares