Cargando…

Random forest versus logistic regression: a large-scale benchmark experiment

BACKGROUND AND GOAL: The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. R...

Descripción completa

Detalles Bibliográficos
Autores principales:	Couronné, Raphael, Probst, Philipp, Boulesteix, Anne-Laure
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6050737/ https://www.ncbi.nlm.nih.gov/pubmed/30016950 http://dx.doi.org/10.1186/s12859-018-2264-5

_version_	1783340400020291584
author	Couronné, Raphael Probst, Philipp Boulesteix, Anne-Laure
author_facet	Couronné, Raphael Probst, Philipp Boulesteix, Anne-Laure
author_sort	Couronné, Raphael
collection	PubMed
description	BACKGROUND AND GOAL: The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. RESULTS: In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. CONCLUSION: RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2264-5) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6050737
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-60507372018-07-19 Random forest versus logistic regression: a large-scale benchmark experiment Couronné, Raphael Probst, Philipp Boulesteix, Anne-Laure BMC Bioinformatics Research Article BACKGROUND AND GOAL: The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. RESULTS: In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. CONCLUSION: RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2264-5) contains supplementary material, which is available to authorized users. BioMed Central 2018-07-17 /pmc/articles/PMC6050737/ /pubmed/30016950 http://dx.doi.org/10.1186/s12859-018-2264-5 Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Couronné, Raphael Probst, Philipp Boulesteix, Anne-Laure Random forest versus logistic regression: a large-scale benchmark experiment
title	Random forest versus logistic regression: a large-scale benchmark experiment
title_full	Random forest versus logistic regression: a large-scale benchmark experiment
title_fullStr	Random forest versus logistic regression: a large-scale benchmark experiment
title_full_unstemmed	Random forest versus logistic regression: a large-scale benchmark experiment
title_short	Random forest versus logistic regression: a large-scale benchmark experiment
title_sort	random forest versus logistic regression: a large-scale benchmark experiment
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6050737/ https://www.ncbi.nlm.nih.gov/pubmed/30016950 http://dx.doi.org/10.1186/s12859-018-2264-5
work_keys_str_mv	AT couronneraphael randomforestversuslogisticregressionalargescalebenchmarkexperiment AT probstphilipp randomforestversuslogisticregressionalargescalebenchmarkexperiment AT boulesteixannelaure randomforestversuslogisticregressionalargescalebenchmarkexperiment

Random forest versus logistic regression: a large-scale benchmark experiment

Ejemplares similares