Cargando…

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction

BACKGROUND: In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. Th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Boulesteix, Anne-Laure, Strobl, Carolin
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813849/ https://www.ncbi.nlm.nih.gov/pubmed/20025773 http://dx.doi.org/10.1186/1471-2288-9-85

_version_	1782176957578870784
author	Boulesteix, Anne-Laure Strobl, Carolin
author_facet	Boulesteix, Anne-Laure Strobl, Carolin
author_sort	Boulesteix, Anne-Laure
collection	PubMed
description	BACKGROUND: In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. METHODS: In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. RESULTS: We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. CONCLUSIONS: The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
format	Text
id	pubmed-2813849
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28138492010-01-30 Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction Boulesteix, Anne-Laure Strobl, Carolin BMC Med Res Methodol Research article BACKGROUND: In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. METHODS: In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. RESULTS: We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. CONCLUSIONS: The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy. BioMed Central 2009-12-21 /pmc/articles/PMC2813849/ /pubmed/20025773 http://dx.doi.org/10.1186/1471-2288-9-85 Text en Copyright ©2009 Boulesteix and Strobl; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Boulesteix, Anne-Laure Strobl, Carolin Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title	Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_full	Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_fullStr	Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_full_unstemmed	Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_short	Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
title_sort	optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813849/ https://www.ncbi.nlm.nih.gov/pubmed/20025773 http://dx.doi.org/10.1186/1471-2288-9-85
work_keys_str_mv	AT boulesteixannelaure optimalclassifierselectionandnegativebiasinerrorrateestimationanempiricalstudyonhighdimensionalprediction AT stroblcarolin optimalclassifierselectionandnegativebiasinerrorrateestimationanempiricalstudyonhighdimensionalprediction

Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction

Ejemplares similares