Cargando…

Stratification bias in low signal microarray studies

BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is...

Descripción completa

Detalles Bibliográficos
Autores principales:	Parker, Brian J, Günter, Simon, Bedo, Justin
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2211509/ https://www.ncbi.nlm.nih.gov/pubmed/17764577 http://dx.doi.org/10.1186/1471-2105-8-326

_version_	1782148531987939328
author	Parker, Brian J Günter, Simon Bedo, Justin
author_facet	Parker, Brian J Günter, Simon Bedo, Justin
author_sort	Parker, Brian J
collection	PubMed
description	BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets.
format	Text
id	pubmed-2211509
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22115092008-01-23 Stratification bias in low signal microarray studies Parker, Brian J Günter, Simon Bedo, Justin BMC Bioinformatics Research Article BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets. BioMed Central 2007-09-02 /pmc/articles/PMC2211509/ /pubmed/17764577 http://dx.doi.org/10.1186/1471-2105-8-326 Text en Copyright © 2007 Parker et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Parker, Brian J Günter, Simon Bedo, Justin Stratification bias in low signal microarray studies
title	Stratification bias in low signal microarray studies
title_full	Stratification bias in low signal microarray studies
title_fullStr	Stratification bias in low signal microarray studies
title_full_unstemmed	Stratification bias in low signal microarray studies
title_short	Stratification bias in low signal microarray studies
title_sort	stratification bias in low signal microarray studies
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2211509/ https://www.ncbi.nlm.nih.gov/pubmed/17764577 http://dx.doi.org/10.1186/1471-2105-8-326
work_keys_str_mv	AT parkerbrianj stratificationbiasinlowsignalmicroarraystudies AT guntersimon stratificationbiasinlowsignalmicroarraystudies AT bedojustin stratificationbiasinlowsignalmicroarraystudies

Stratification bias in low signal microarray studies

Ejemplares similares