Cargando…

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via t...

Descripción completa

Detalles Bibliográficos
Autores principales: Hanczar, Blaise, Hua, Jianping, Dougherty, Edward R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3171336/
https://www.ncbi.nlm.nih.gov/pubmed/18288255
http://dx.doi.org/10.1155/2007/38473
_version_ 1782211739573551104
author Hanczar, Blaise
Hua, Jianping
Dougherty, Edward R
author_facet Hanczar, Blaise
Hua, Jianping
Dougherty, Edward R
author_sort Hanczar, Blaise
collection PubMed
description The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, [Image: see text]-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.
format Online
Article
Text
id pubmed-3171336
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher Springer
record_format MEDLINE/PubMed
spelling pubmed-31713362011-09-13 Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings Hanczar, Blaise Hua, Jianping Dougherty, Edward R EURASIP J Bioinform Syst Biol Research Article The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, [Image: see text]-fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models. Springer 2007-10-30 /pmc/articles/PMC3171336/ /pubmed/18288255 http://dx.doi.org/10.1155/2007/38473 Text en Copyright © 2007 Blaise Hanczar et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Hanczar, Blaise
Hua, Jianping
Dougherty, Edward R
Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_full Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_fullStr Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_full_unstemmed Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_short Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
title_sort decorrelation of the true and estimated classifier errors in high-dimensional settings
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3171336/
https://www.ncbi.nlm.nih.gov/pubmed/18288255
http://dx.doi.org/10.1155/2007/38473
work_keys_str_mv AT hanczarblaise decorrelationofthetrueandestimatedclassifiererrorsinhighdimensionalsettings
AT huajianping decorrelationofthetrueandestimatedclassifiererrorsinhighdimensionalsettings
AT doughertyedwardr decorrelationofthetrueandestimatedclassifiererrorsinhighdimensionalsettings