Cargando…

Correcting for Optimistic Prediction in Small Data Sets

The C statistic is a commonly reported measure of screening test performance. Optimistic estimation of the C statistic is a frequent problem because of overfitting of statistical models in small data sets, and methods exist to correct for this issue. However, many studies do not use such methods, an...

Descripción completa

Detalles Bibliográficos
Autores principales:	Smith, Gordon C. S., Seaman, Shaun R., Wood, Angela M., Royston, Patrick, White, Ian R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2014
Materias:	Practice of Epidemiology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4108045/ https://www.ncbi.nlm.nih.gov/pubmed/24966219 http://dx.doi.org/10.1093/aje/kwu140

_version_	1782327699182714880
author	Smith, Gordon C. S. Seaman, Shaun R. Wood, Angela M. Royston, Patrick White, Ian R.
author_facet	Smith, Gordon C. S. Seaman, Shaun R. Wood, Angela M. Royston, Patrick White, Ian R.
author_sort	Smith, Gordon C. S.
collection	PubMed
description	The C statistic is a commonly reported measure of screening test performance. Optimistic estimation of the C statistic is a frequent problem because of overfitting of statistical models in small data sets, and methods exist to correct for this issue. However, many studies do not use such methods, and those that do correct for optimism use diverse methods, some of which are known to be biased. We used clinical data sets (United Kingdom Down syndrome screening data from Glasgow (1991–2003), Edinburgh (1999–2003), and Cambridge (1990–2006), as well as Scottish national pregnancy discharge data (2004–2007)) to evaluate different approaches to adjustment for optimism. We found that sample splitting, cross-validation without replication, and leave-1-out cross-validation produced optimism-adjusted estimates of the C statistic that were biased and/or associated with greater absolute error than other available methods. Cross-validation with replication, bootstrapping, and a new method (leave-pair-out cross-validation) all generated unbiased optimism-adjusted estimates of the C statistic and had similar absolute errors in the clinical data set. Larger simulation studies confirmed that all 3 methods performed similarly with 10 or more events per variable, or when the C statistic was 0.9 or greater. However, with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than both methods of cross-validation.
format	Online Article Text
id	pubmed-4108045
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-41080452014-07-25 Correcting for Optimistic Prediction in Small Data Sets Smith, Gordon C. S. Seaman, Shaun R. Wood, Angela M. Royston, Patrick White, Ian R. Am J Epidemiol Practice of Epidemiology The C statistic is a commonly reported measure of screening test performance. Optimistic estimation of the C statistic is a frequent problem because of overfitting of statistical models in small data sets, and methods exist to correct for this issue. However, many studies do not use such methods, and those that do correct for optimism use diverse methods, some of which are known to be biased. We used clinical data sets (United Kingdom Down syndrome screening data from Glasgow (1991–2003), Edinburgh (1999–2003), and Cambridge (1990–2006), as well as Scottish national pregnancy discharge data (2004–2007)) to evaluate different approaches to adjustment for optimism. We found that sample splitting, cross-validation without replication, and leave-1-out cross-validation produced optimism-adjusted estimates of the C statistic that were biased and/or associated with greater absolute error than other available methods. Cross-validation with replication, bootstrapping, and a new method (leave-pair-out cross-validation) all generated unbiased optimism-adjusted estimates of the C statistic and had similar absolute errors in the clinical data set. Larger simulation studies confirmed that all 3 methods performed similarly with 10 or more events per variable, or when the C statistic was 0.9 or greater. However, with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than both methods of cross-validation. Oxford University Press 2014-08-01 2014-06-24 /pmc/articles/PMC4108045/ /pubmed/24966219 http://dx.doi.org/10.1093/aje/kwu140 Text en © The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited
spellingShingle	Practice of Epidemiology Smith, Gordon C. S. Seaman, Shaun R. Wood, Angela M. Royston, Patrick White, Ian R. Correcting for Optimistic Prediction in Small Data Sets
title	Correcting for Optimistic Prediction in Small Data Sets
title_full	Correcting for Optimistic Prediction in Small Data Sets
title_fullStr	Correcting for Optimistic Prediction in Small Data Sets
title_full_unstemmed	Correcting for Optimistic Prediction in Small Data Sets
title_short	Correcting for Optimistic Prediction in Small Data Sets
title_sort	correcting for optimistic prediction in small data sets
topic	Practice of Epidemiology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4108045/ https://www.ncbi.nlm.nih.gov/pubmed/24966219 http://dx.doi.org/10.1093/aje/kwu140
work_keys_str_mv	AT smithgordoncs correctingforoptimisticpredictioninsmalldatasets AT seamanshaunr correctingforoptimisticpredictioninsmalldatasets AT woodangelam correctingforoptimisticpredictioninsmalldatasets AT roystonpatrick correctingforoptimisticpredictioninsmalldatasets AT whiteianr correctingforoptimisticpredictioninsmalldatasets

Correcting for Optimistic Prediction in Small Data Sets

Ejemplares similares