Cargando…

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

BACKGROUND: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – e...

Descripción completa

Detalles Bibliográficos
Autores principales:	Baumann, Désirée, Baumann, Knut
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4260165/ https://www.ncbi.nlm.nih.gov/pubmed/25506400 http://dx.doi.org/10.1186/s13321-014-0047-1

_version_	1782348133646204928
author	Baumann, Désirée Baumann, Knut
author_facet	Baumann, Désirée Baumann, Knut
author_sort	Baumann, Désirée
collection	PubMed
description	BACKGROUND: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. METHODS: Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. RESULTS: The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. CONCLUSIONS: Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-014-0047-1) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4260165
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-42601652014-12-11 Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation Baumann, Désirée Baumann, Knut J Cheminform Research Article BACKGROUND: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. METHODS: Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. RESULTS: The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. CONCLUSIONS: Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-014-0047-1) contains supplementary material, which is available to authorized users. Springer International Publishing 2014-11-26 /pmc/articles/PMC4260165/ /pubmed/25506400 http://dx.doi.org/10.1186/s13321-014-0047-1 Text en © Baumann and Baumann; licensee Chemistry Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Baumann, Désirée Baumann, Knut Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
title	Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
title_full	Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
title_fullStr	Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
title_full_unstemmed	Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
title_short	Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation
title_sort	reliable estimation of prediction errors for qsar models under model uncertainty using double cross-validation
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4260165/ https://www.ncbi.nlm.nih.gov/pubmed/25506400 http://dx.doi.org/10.1186/s13321-014-0047-1
work_keys_str_mv	AT baumanndesiree reliableestimationofpredictionerrorsforqsarmodelsundermodeluncertaintyusingdoublecrossvalidation AT baumannknut reliableestimationofpredictionerrorsforqsarmodelsundermodeluncertaintyusingdoublecrossvalidation

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

Ejemplares similares