Cargando…

Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of s...

Descripción completa

Detalles Bibliográficos
Autores principales: Xiao, Yufei, Hua, Jianping, Dougherty, Edward R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3171328/
https://www.ncbi.nlm.nih.gov/pubmed/17713587
http://dx.doi.org/10.1155/2007/16354
_version_ 1782211737812992000
author Xiao, Yufei
Hua, Jianping
Dougherty, Edward R
author_facet Xiao, Yufei
Hua, Jianping
Dougherty, Edward R
author_sort Xiao, Yufei
collection PubMed
description Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the [Image: see text]-test for feature selection; and [Image: see text]-fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.
format Online
Article
Text
id pubmed-3171328
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher Springer
record_format MEDLINE/PubMed
spelling pubmed-31713282011-09-13 Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation Xiao, Yufei Hua, Jianping Dougherty, Edward R EURASIP J Bioinform Syst Biol Research Article Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the [Image: see text]-test for feature selection; and [Image: see text]-fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance. Springer 2007-02-19 /pmc/articles/PMC3171328/ /pubmed/17713587 http://dx.doi.org/10.1155/2007/16354 Text en Copyright © 2007 Yufei Xiao et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Xiao, Yufei
Hua, Jianping
Dougherty, Edward R
Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation
title Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation
title_full Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation
title_fullStr Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation
title_full_unstemmed Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation
title_short Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation
title_sort quantification of the impact of feature selection on the variance of cross-validation error estimation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3171328/
https://www.ncbi.nlm.nih.gov/pubmed/17713587
http://dx.doi.org/10.1155/2007/16354
work_keys_str_mv AT xiaoyufei quantificationoftheimpactoffeatureselectiononthevarianceofcrossvalidationerrorestimation
AT huajianping quantificationoftheimpactoffeatureselectiononthevarianceofcrossvalidationerrorestimation
AT doughertyedwardr quantificationoftheimpactoffeatureselectiononthevarianceofcrossvalidationerrorestimation