Cargando…

Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

BACKGROUND: Modern modelling techniques may potentially provide more accurate predictions of binary outcomes than classical techniques. We aimed to study the predictive performance of different modelling techniques in relation to the effective sample size (“data hungriness”). METHODS: We performed s...

Descripción completa

Detalles Bibliográficos
Autores principales:	van der Ploeg, Tjeerd, Austin, Peter C, Steyerberg, Ewout W
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4289553/ https://www.ncbi.nlm.nih.gov/pubmed/25532820 http://dx.doi.org/10.1186/1471-2288-14-137

_version_	1782352127449890816
author	van der Ploeg, Tjeerd Austin, Peter C Steyerberg, Ewout W
author_facet	van der Ploeg, Tjeerd Austin, Peter C Steyerberg, Ewout W
author_sort	van der Ploeg, Tjeerd
collection	PubMed
description	BACKGROUND: Modern modelling techniques may potentially provide more accurate predictions of binary outcomes than classical techniques. We aimed to study the predictive performance of different modelling techniques in relation to the effective sample size (“data hungriness”). METHODS: We performed simulation studies based on three clinical cohorts: 1282 patients with head and neck cancer (with 46.9% 5 year survival), 1731 patients with traumatic brain injury (22.3% 6 month mortality) and 3181 patients with minor head injury (7.6% with CT scan abnormalities). We compared three relatively modern modelling techniques: support vector machines (SVM), neural nets (NN), and random forests (RF) and two classical techniques: logistic regression (LR) and classification and regression trees (CART). We created three large artificial databases with 20 fold, 10 fold and 6 fold replication of subjects, where we generated dichotomous outcomes according to different underlying models. We applied each modelling technique to increasingly larger development parts (100 repetitions). The area under the ROC-curve (AUC) indicated the performance of each model in the development part and in an independent validation part. Data hungriness was defined by plateauing of AUC and small optimism (difference between the mean apparent AUC and the mean validated AUC <0.01). RESULTS: We found that a stable AUC was reached by LR at approximately 20 to 50 events per variable, followed by CART, SVM, NN and RF models. Optimism decreased with increasing sample sizes and the same ranking of techniques. The RF, SVM and NN models showed instability and a high optimism even with >200 events per variable. CONCLUSIONS: Modern modelling techniques such as SVM, NN and RF may need over 10 times as many events per variable to achieve a stable AUC and a small optimism than classical modelling techniques such as LR. This implies that such modern techniques should only be used in medical prediction problems if very large data sets are available. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2288-14-137) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4289553
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42895532015-01-12 Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints van der Ploeg, Tjeerd Austin, Peter C Steyerberg, Ewout W BMC Med Res Methodol Research Article BACKGROUND: Modern modelling techniques may potentially provide more accurate predictions of binary outcomes than classical techniques. We aimed to study the predictive performance of different modelling techniques in relation to the effective sample size (“data hungriness”). METHODS: We performed simulation studies based on three clinical cohorts: 1282 patients with head and neck cancer (with 46.9% 5 year survival), 1731 patients with traumatic brain injury (22.3% 6 month mortality) and 3181 patients with minor head injury (7.6% with CT scan abnormalities). We compared three relatively modern modelling techniques: support vector machines (SVM), neural nets (NN), and random forests (RF) and two classical techniques: logistic regression (LR) and classification and regression trees (CART). We created three large artificial databases with 20 fold, 10 fold and 6 fold replication of subjects, where we generated dichotomous outcomes according to different underlying models. We applied each modelling technique to increasingly larger development parts (100 repetitions). The area under the ROC-curve (AUC) indicated the performance of each model in the development part and in an independent validation part. Data hungriness was defined by plateauing of AUC and small optimism (difference between the mean apparent AUC and the mean validated AUC <0.01). RESULTS: We found that a stable AUC was reached by LR at approximately 20 to 50 events per variable, followed by CART, SVM, NN and RF models. Optimism decreased with increasing sample sizes and the same ranking of techniques. The RF, SVM and NN models showed instability and a high optimism even with >200 events per variable. CONCLUSIONS: Modern modelling techniques such as SVM, NN and RF may need over 10 times as many events per variable to achieve a stable AUC and a small optimism than classical modelling techniques such as LR. This implies that such modern techniques should only be used in medical prediction problems if very large data sets are available. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2288-14-137) contains supplementary material, which is available to authorized users. BioMed Central 2014-12-22 /pmc/articles/PMC4289553/ /pubmed/25532820 http://dx.doi.org/10.1186/1471-2288-14-137 Text en © van der Ploeg et al.; licensee BioMed Central. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article van der Ploeg, Tjeerd Austin, Peter C Steyerberg, Ewout W Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
title	Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
title_full	Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
title_fullStr	Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
title_full_unstemmed	Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
title_short	Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
title_sort	modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4289553/ https://www.ncbi.nlm.nih.gov/pubmed/25532820 http://dx.doi.org/10.1186/1471-2288-14-137
work_keys_str_mv	AT vanderploegtjeerd modernmodellingtechniquesaredatahungryasimulationstudyforpredictingdichotomousendpoints AT austinpeterc modernmodellingtechniquesaredatahungryasimulationstudyforpredictingdichotomousendpoints AT steyerbergewoutw modernmodellingtechniquesaredatahungryasimulationstudyforpredictingdichotomousendpoints

Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

Ejemplares similares