Cargando…
Improved variance estimation of classification performance via reduction of bias caused by small sample size
BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than...
Autores principales: | , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435937/ https://www.ncbi.nlm.nih.gov/pubmed/16533392 http://dx.doi.org/10.1186/1471-2105-7-127 |
_version_ | 1782127299369369600 |
---|---|
author | Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders |
author_facet | Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders |
author_sort | Wickenberg-Bolin, Ulrika |
collection | PubMed |
description | BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). RESULTS: Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. CONCLUSION: We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets. |
format | Text |
id | pubmed-1435937 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-14359372006-04-21 Improved variance estimation of classification performance via reduction of bias caused by small sample size Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders BMC Bioinformatics Research Article BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). RESULTS: Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. CONCLUSION: We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets. BioMed Central 2006-03-13 /pmc/articles/PMC1435937/ /pubmed/16533392 http://dx.doi.org/10.1186/1471-2105-7-127 Text en Copyright © 2006 Wickenberg-Bolin et al; licensee BioMed Central Ltd. |
spellingShingle | Research Article Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders Improved variance estimation of classification performance via reduction of bias caused by small sample size |
title | Improved variance estimation of classification performance via reduction of bias caused by small sample size |
title_full | Improved variance estimation of classification performance via reduction of bias caused by small sample size |
title_fullStr | Improved variance estimation of classification performance via reduction of bias caused by small sample size |
title_full_unstemmed | Improved variance estimation of classification performance via reduction of bias caused by small sample size |
title_short | Improved variance estimation of classification performance via reduction of bias caused by small sample size |
title_sort | improved variance estimation of classification performance via reduction of bias caused by small sample size |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435937/ https://www.ncbi.nlm.nih.gov/pubmed/16533392 http://dx.doi.org/10.1186/1471-2105-7-127 |
work_keys_str_mv | AT wickenbergbolinulrika improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT goranssonhanna improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT fryknasmarten improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT gustafssonmatsg improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT isakssonanders improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize |