Cargando…

Improved variance estimation of classification performance via reduction of bias caused by small sample size

BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than...

Descripción completa

Detalles Bibliográficos
Autores principales: Wickenberg-Bolin, Ulrika, Göransson, Hanna, Fryknäs, Mårten, Gustafsson, Mats G, Isaksson, Anders
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435937/
https://www.ncbi.nlm.nih.gov/pubmed/16533392
http://dx.doi.org/10.1186/1471-2105-7-127
_version_ 1782127299369369600
author Wickenberg-Bolin, Ulrika
Göransson, Hanna
Fryknäs, Mårten
Gustafsson, Mats G
Isaksson, Anders
author_facet Wickenberg-Bolin, Ulrika
Göransson, Hanna
Fryknäs, Mårten
Gustafsson, Mats G
Isaksson, Anders
author_sort Wickenberg-Bolin, Ulrika
collection PubMed
description BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). RESULTS: Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. CONCLUSION: We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets.
format Text
id pubmed-1435937
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-14359372006-04-21 Improved variance estimation of classification performance via reduction of bias caused by small sample size Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders BMC Bioinformatics Research Article BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). RESULTS: Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. CONCLUSION: We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets. BioMed Central 2006-03-13 /pmc/articles/PMC1435937/ /pubmed/16533392 http://dx.doi.org/10.1186/1471-2105-7-127 Text en Copyright © 2006 Wickenberg-Bolin et al; licensee BioMed Central Ltd.
spellingShingle Research Article
Wickenberg-Bolin, Ulrika
Göransson, Hanna
Fryknäs, Mårten
Gustafsson, Mats G
Isaksson, Anders
Improved variance estimation of classification performance via reduction of bias caused by small sample size
title Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_full Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_fullStr Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_full_unstemmed Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_short Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_sort improved variance estimation of classification performance via reduction of bias caused by small sample size
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435937/
https://www.ncbi.nlm.nih.gov/pubmed/16533392
http://dx.doi.org/10.1186/1471-2105-7-127
work_keys_str_mv AT wickenbergbolinulrika improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize
AT goranssonhanna improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize
AT fryknasmarten improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize
AT gustafssonmatsg improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize
AT isakssonanders improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize