Cargando…

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

BACKGROUND: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – e...

Descripción completa

Detalles Bibliográficos
Autores principales: Baumann, Désirée, Baumann, Knut
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4260165/
https://www.ncbi.nlm.nih.gov/pubmed/25506400
http://dx.doi.org/10.1186/s13321-014-0047-1
Descripción
Sumario:BACKGROUND: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. METHODS: Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. RESULTS: The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. CONCLUSIONS: Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-014-0047-1) contains supplementary material, which is available to authorized users.