Cargando…

Improved variance estimation of classification performance via reduction of bias caused by small sample size

BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wickenberg-Bolin, Ulrika, Göransson, Hanna, Fryknäs, Mårten, Gustafsson, Mats G, Isaksson, Anders
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435937/ https://www.ncbi.nlm.nih.gov/pubmed/16533392 http://dx.doi.org/10.1186/1471-2105-7-127

_version_	1782127299369369600
author	Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders
author_facet	Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders
author_sort	Wickenberg-Bolin, Ulrika
collection	PubMed
description	BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). RESULTS: Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. CONCLUSION: We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets.
format	Text
id	pubmed-1435937
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-14359372006-04-21 Improved variance estimation of classification performance via reduction of bias caused by small sample size Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders BMC Bioinformatics Research Article BACKGROUND: Supervised learning for classification of cancer employs a set of design examples to learn how to discriminate between tumors. In practice it is crucial to confirm that the classifier is robust with good generalization performance to new examples, or at least that it performs better than random guessing. A suggested alternative is to obtain a confidence interval of the error rate using repeated design and test sets selected from available examples. However, it is known that even in the ideal situation of repeated designs and tests with completely novel samples in each cycle, a small test set size leads to a large bias in the estimate of the true variance between design sets. Therefore different methods for small sample performance estimation such as a recently proposed procedure called Repeated Random Sampling (RSS) is also expected to result in heavily biased estimates, which in turn translates into biased confidence intervals. Here we explore such biases and develop a refined algorithm called Repeated Independent Design and Test (RIDT). RESULTS: Our simulations reveal that repeated designs and tests based on resampling in a fixed bag of samples yield a biased variance estimate. We also demonstrate that it is possible to obtain an improved variance estimate by means of a procedure that explicitly models how this bias depends on the number of samples used for testing. For the special case of repeated designs and tests using new samples for each design and test, we present an exact analytical expression for how the expected value of the bias decreases with the size of the test set. CONCLUSION: We show that via modeling and subsequent reduction of the small sample bias, it is possible to obtain an improved estimate of the variance of classifier performance between design sets. However, the uncertainty of the variance estimate is large in the simulations performed indicating that the method in its present form cannot be directly applied to small data sets. BioMed Central 2006-03-13 /pmc/articles/PMC1435937/ /pubmed/16533392 http://dx.doi.org/10.1186/1471-2105-7-127 Text en Copyright © 2006 Wickenberg-Bolin et al; licensee BioMed Central Ltd.
spellingShingle	Research Article Wickenberg-Bolin, Ulrika Göransson, Hanna Fryknäs, Mårten Gustafsson, Mats G Isaksson, Anders Improved variance estimation of classification performance via reduction of bias caused by small sample size
title	Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_full	Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_fullStr	Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_full_unstemmed	Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_short	Improved variance estimation of classification performance via reduction of bias caused by small sample size
title_sort	improved variance estimation of classification performance via reduction of bias caused by small sample size
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1435937/ https://www.ncbi.nlm.nih.gov/pubmed/16533392 http://dx.doi.org/10.1186/1471-2105-7-127
work_keys_str_mv	AT wickenbergbolinulrika improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT goranssonhanna improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT fryknasmarten improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT gustafssonmatsg improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize AT isakssonanders improvedvarianceestimationofclassificationperformanceviareductionofbiascausedbysmallsamplesize

Improved variance estimation of classification performance via reduction of bias caused by small sample size

Ejemplares similares