Cargando…

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hanczar, Blaise, Hua, Jianping, Dougherty, Edward R
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer 2007
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3171336/ https://www.ncbi.nlm.nih.gov/pubmed/18288255 http://dx.doi.org/10.1155/2007/38473

_version_	1782211739573551104
author	Hanczar, Blaise Hua, Jianping Dougherty, Edward R
author_facet	Hanczar, Blaise Hua, Jianping Dougherty, Edward R
author_sort	Hanczar, Blaise
collection	PubMed
description	The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, [Image: see text]-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.
format	Online Article Text
id	pubmed-3171336
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	Springer
record_format	MEDLINE/PubMed
spelling	pubmed-31713362011-09-13 Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings Hanczar, Blaise Hua, Jianping Dougherty, Edward R EURASIP J Bioinform Syst Biol Research Article The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, [Image: see text]-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models. Springer 2007-10-30 /pmc/articles/PMC3171336/ /pubmed/18288255 http://dx.doi.org/10.1155/2007/38473 Text en Copyright © 2007 Blaise Hanczar et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Hanczar, Blaise Hua, Jianping Dougherty, Edward R Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title	Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_full	Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_fullStr	Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_full_unstemmed	Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_short	Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_sort	decorrelation of the true and estimated classifier errors in high-dimensional settings
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3171336/ https://www.ncbi.nlm.nih.gov/pubmed/18288255 http://dx.doi.org/10.1155/2007/38473
work_keys_str_mv	AT hanczarblaise decorrelationofthetrueandestimatedclassifiererrorsinhighdimensionalsettings AT huajianping decorrelationofthetrueandestimatedclassifiererrorsinhighdimensionalsettings AT doughertyedwardr decorrelationofthetrueandestimatedclassifiererrorsinhighdimensionalsettings

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

Ejemplares similares