Cargando…

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

BACKGROUND: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hornung, Roman, Bernau, Christoph, Truntzer, Caroline, Wilson, Rory, Stadler, Thomas, Boulesteix, Anne-Laure
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4634762/ https://www.ncbi.nlm.nih.gov/pubmed/26537575 http://dx.doi.org/10.1186/s12874-015-0088-9

_version_	1782399414513434624
author	Hornung, Roman Bernau, Christoph Truntzer, Caroline Wilson, Rory Stadler, Thomas Boulesteix, Anne-Laure
author_facet	Hornung, Roman Bernau, Christoph Truntzer, Caroline Wilson, Rory Stadler, Thomas Boulesteix, Anne-Laure
author_sort	Hornung, Roman
collection	PubMed
description	BACKGROUND: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. METHODS: We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. RESULTS: Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. CONCLUSIONS: While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12874-015-0088-9) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4634762
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46347622015-11-06 A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization Hornung, Roman Bernau, Christoph Truntzer, Caroline Wilson, Rory Stadler, Thomas Boulesteix, Anne-Laure BMC Med Res Methodol Research Article BACKGROUND: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. METHODS: We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. RESULTS: Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. CONCLUSIONS: While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12874-015-0088-9) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-04 /pmc/articles/PMC4634762/ /pubmed/26537575 http://dx.doi.org/10.1186/s12874-015-0088-9 Text en © Hornung et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Hornung, Roman Bernau, Christoph Truntzer, Caroline Wilson, Rory Stadler, Thomas Boulesteix, Anne-Laure A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
title	A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
title_full	A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
title_fullStr	A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
title_full_unstemmed	A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
title_short	A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
title_sort	measure of the impact of cv incompleteness on prediction error estimation with application to pca and normalization
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4634762/ https://www.ncbi.nlm.nih.gov/pubmed/26537575 http://dx.doi.org/10.1186/s12874-015-0088-9
work_keys_str_mv	AT hornungroman ameasureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT bernauchristoph ameasureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT truntzercaroline ameasureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT wilsonrory ameasureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT stadlerthomas ameasureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT boulesteixannelaure ameasureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT hornungroman measureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT bernauchristoph measureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT truntzercaroline measureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT wilsonrory measureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT stadlerthomas measureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization AT boulesteixannelaure measureoftheimpactofcvincompletenessonpredictionerrorestimationwithapplicationtopcaandnormalization

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

Ejemplares similares