Cargando…

Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models

BACKGROUND: Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable...

Descripción completa

Detalles Bibliográficos
Autores principales:	Blagus, Rok, Lusa, Lara
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4634915/ https://www.ncbi.nlm.nih.gov/pubmed/26537827 http://dx.doi.org/10.1186/s12859-015-0784-9

_version_	1782399438380072960
author	Blagus, Rok Lusa, Lara
author_facet	Blagus, Rok Lusa, Lara
author_sort	Blagus, Rok
collection	PubMed
description	BACKGROUND: Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). RESULTS: Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. CONCLUSIONS: We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0784-9) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4634915
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46349152015-11-06 Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models Blagus, Rok Lusa, Lara BMC Bioinformatics Research Article BACKGROUND: Prediction models are used in clinical research to develop rules that can be used to accurately predict the outcome of the patients based on some of their characteristics. They represent a valuable tool in the decision making process of clinicians and health policy makers, as they enable them to estimate the probability that patients have or will develop a disease, will respond to a treatment, or that their disease will recur. The interest devoted to prediction models in the biomedical community has been growing in the last few years. Often the data used to develop the prediction models are class-imbalanced as only few patients experience the event (and therefore belong to minority class). RESULTS: Prediction models developed using class-imbalanced data tend to achieve sub-optimal predictive accuracy in the minority class. This problem can be diminished by using sampling techniques aimed at balancing the class distribution. These techniques include under- and oversampling, where a fraction of the majority class samples are retained in the analysis or new samples from the minority class are generated. The correct assessment of how the prediction model is likely to perform on independent data is of crucial importance; in the absence of an independent data set, cross-validation is normally used. While the importance of correct cross-validation is well documented in the biomedical literature, the challenges posed by the joint use of sampling techniques and cross-validation have not been addressed. CONCLUSIONS: We show that care must be taken to ensure that cross-validation is performed correctly on sampled data, and that the risk of overestimating the predictive accuracy is greater when oversampling techniques are used. Examples based on the re-analysis of real datasets and simulation studies are provided. We identify some results from the biomedical literature where the incorrect cross-validation was performed, where we expect that the performance of oversampling techniques was heavily overestimated. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0784-9) contains supplementary material, which is available to authorized users. BioMed Central 2015-11-04 /pmc/articles/PMC4634915/ /pubmed/26537827 http://dx.doi.org/10.1186/s12859-015-0784-9 Text en © Blagus and Lusa. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Blagus, Rok Lusa, Lara Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
title	Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
title_full	Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
title_fullStr	Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
title_full_unstemmed	Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
title_short	Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
title_sort	joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4634915/ https://www.ncbi.nlm.nih.gov/pubmed/26537827 http://dx.doi.org/10.1186/s12859-015-0784-9
work_keys_str_mv	AT blagusrok jointuseofoverandundersamplingtechniquesandcrossvalidationforthedevelopmentandassessmentofpredictionmodels AT lusalara jointuseofoverandundersamplingtechniquesandcrossvalidationforthedevelopmentandassessmentofpredictionmodels

Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models

Ejemplares similares