Cargando…

Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting

Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accu...

Descripción completa

Detalles Bibliográficos
Autores principales:	Austin, Peter C, Harrell, Frank E, Steyerberg, Ewout W
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	SAGE Publications 2021
Materias:	Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8188999/ https://www.ncbi.nlm.nih.gov/pubmed/33848231 http://dx.doi.org/10.1177/09622802211002867

_version_	1783705433774489600
author	Austin, Peter C Harrell, Frank E Steyerberg, Ewout W
author_facet	Austin, Peter C Harrell, Frank E Steyerberg, Ewout W
author_sort	Austin, Peter C
collection	PubMed
description	Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accuracy of six machine and statistical learning methods: bagged classification trees, stochastic gradient boosting machines using trees as the base learners, random forests, the lasso, ridge regression, and unpenalized logistic regression. We performed simulations in two large cardiovascular datasets which each comprised an independent derivation and validation sample collected from temporally distinct periods: patients hospitalized with acute myocardial infarction (AMI, n = 9484 vs. n = 7000) and patients hospitalized with congestive heart failure (CHF, n = 8240 vs. n = 7608). We used six data-generating processes based on each of the six learning methods to simulate outcomes in the derivation and validation samples based on 33 and 28 predictors in the AMI and CHF data sets, respectively. We applied six prediction methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples according to c-statistic, generalized R(2), Brier score, and calibration. While no method had uniformly superior performance across all six data-generating process and eight performance metrics, (un)penalized logistic regression and boosted trees tended to have superior performance to the other methods across a range of data-generating processes and performance metrics. This study confirms that classical statistical learning methods perform well in low-dimensional settings with large data sets.
format	Online Article Text
id	pubmed-8188999
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	SAGE Publications
record_format	MEDLINE/PubMed
spelling	pubmed-81889992021-06-21 Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting Austin, Peter C Harrell, Frank E Steyerberg, Ewout W Stat Methods Med Res Articles Machine learning approaches are increasingly suggested as tools to improve prediction of clinical outcomes. We aimed to identify when machine learning methods perform better than a classical learning method. We hereto examined the impact of the data-generating process on the relative predictive accuracy of six machine and statistical learning methods: bagged classification trees, stochastic gradient boosting machines using trees as the base learners, random forests, the lasso, ridge regression, and unpenalized logistic regression. We performed simulations in two large cardiovascular datasets which each comprised an independent derivation and validation sample collected from temporally distinct periods: patients hospitalized with acute myocardial infarction (AMI, n = 9484 vs. n = 7000) and patients hospitalized with congestive heart failure (CHF, n = 8240 vs. n = 7608). We used six data-generating processes based on each of the six learning methods to simulate outcomes in the derivation and validation samples based on 33 and 28 predictors in the AMI and CHF data sets, respectively. We applied six prediction methods in each of the simulated derivation samples and evaluated performance in the simulated validation samples according to c-statistic, generalized R(2), Brier score, and calibration. While no method had uniformly superior performance across all six data-generating process and eight performance metrics, (un)penalized logistic regression and boosted trees tended to have superior performance to the other methods across a range of data-generating processes and performance metrics. This study confirms that classical statistical learning methods perform well in low-dimensional settings with large data sets. SAGE Publications 2021-04-13 2021-06 /pmc/articles/PMC8188999/ /pubmed/33848231 http://dx.doi.org/10.1177/09622802211002867 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by-nc/4.0/This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle	Articles Austin, Peter C Harrell, Frank E Steyerberg, Ewout W Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting
title	Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting
title_full	Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting
title_fullStr	Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting
title_full_unstemmed	Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting
title_short	Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting
title_sort	predictive performance of machine and statistical learning methods: impact of data-generating processes on external validity in the “large n, small p” setting
topic	Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8188999/ https://www.ncbi.nlm.nih.gov/pubmed/33848231 http://dx.doi.org/10.1177/09622802211002867
work_keys_str_mv	AT austinpeterc predictiveperformanceofmachineandstatisticallearningmethodsimpactofdatageneratingprocessesonexternalvalidityinthelargensmallpsetting AT harrellfranke predictiveperformanceofmachineandstatisticallearningmethodsimpactofdatageneratingprocessesonexternalvalidityinthelargensmallpsetting AT steyerbergewoutw predictiveperformanceofmachineandstatisticallearningmethodsimpactofdatageneratingprocessesonexternalvalidityinthelargensmallpsetting

Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting

Ejemplares similares