Cargando…

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets

BACKGROUND: For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic reg...

Descripción completa

Detalles Bibliográficos
Autores principales:	Šinkovec, Hana, Heinze, Georg, Blagus, Rok, Geroldinger, Angelika
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8482588/ https://www.ncbi.nlm.nih.gov/pubmed/34592945 http://dx.doi.org/10.1186/s12874-021-01374-y

_version_	1784576939955060736
author	Šinkovec, Hana Heinze, Georg Blagus, Rok Geroldinger, Angelika
author_facet	Šinkovec, Hana Heinze, Georg Blagus, Rok Geroldinger, Angelika
author_sort	Šinkovec, Hana
collection	PubMed
description	BACKGROUND: For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic regression can result in highly variable calibration slopes in small or sparse data situations. METHODS: In this paper, we elaborate this issue further by performing a comprehensive simulation study, investigating the performance of ridge logistic regression in terms of coefficients and predictions and comparing it to Firth’s correction that has been shown to perform well in low-dimensional settings. In addition to tuned ridge regression where the penalty strength is estimated from the data by minimizing some measure of the out-of-sample prediction error or information criterion, we also considered ridge regression with pre-specified degree of shrinkage. We included ‘oracle’ models in the simulation study in which the complexity parameter was chosen based on the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) to demonstrate the capability of ridge regression if truth was known. RESULTS: Performance of ridge regression strongly depends on the choice of complexity parameter. As shown in our simulation and illustrated by a data example, values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. In contrast, in our simulations pre-specifying the degree of shrinkage prior to fitting led to accurate coefficients and predictions even in non-ideal settings such as encountered in the context of rare outcomes or sparse predictors. CONCLUSIONS: Applying tuned ridge regression in small or sparse datasets is problematic as it results in unstable coefficients and predictions. In contrast, determining the degree of shrinkage according to some meaningful prior assumptions about true effects has the potential to reduce bias and stabilize the estimates. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01374-y.
format	Online Article Text
id	pubmed-8482588
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-84825882021-10-04 To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets Šinkovec, Hana Heinze, Georg Blagus, Rok Geroldinger, Angelika BMC Med Res Methodol Research BACKGROUND: For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic regression can result in highly variable calibration slopes in small or sparse data situations. METHODS: In this paper, we elaborate this issue further by performing a comprehensive simulation study, investigating the performance of ridge logistic regression in terms of coefficients and predictions and comparing it to Firth’s correction that has been shown to perform well in low-dimensional settings. In addition to tuned ridge regression where the penalty strength is estimated from the data by minimizing some measure of the out-of-sample prediction error or information criterion, we also considered ridge regression with pre-specified degree of shrinkage. We included ‘oracle’ models in the simulation study in which the complexity parameter was chosen based on the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) to demonstrate the capability of ridge regression if truth was known. RESULTS: Performance of ridge regression strongly depends on the choice of complexity parameter. As shown in our simulation and illustrated by a data example, values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. In contrast, in our simulations pre-specifying the degree of shrinkage prior to fitting led to accurate coefficients and predictions even in non-ideal settings such as encountered in the context of rare outcomes or sparse predictors. CONCLUSIONS: Applying tuned ridge regression in small or sparse datasets is problematic as it results in unstable coefficients and predictions. In contrast, determining the degree of shrinkage according to some meaningful prior assumptions about true effects has the potential to reduce bias and stabilize the estimates. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12874-021-01374-y. BioMed Central 2021-09-30 /pmc/articles/PMC8482588/ /pubmed/34592945 http://dx.doi.org/10.1186/s12874-021-01374-y Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Šinkovec, Hana Heinze, Georg Blagus, Rok Geroldinger, Angelika To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
title	To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
title_full	To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
title_fullStr	To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
title_full_unstemmed	To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
title_short	To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
title_sort	to tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8482588/ https://www.ncbi.nlm.nih.gov/pubmed/34592945 http://dx.doi.org/10.1186/s12874-021-01374-y
work_keys_str_mv	AT sinkovechana totuneornottotuneacasestudyofridgelogisticregressioninsmallorsparsedatasets AT heinzegeorg totuneornottotuneacasestudyofridgelogisticregressioninsmallorsparsedatasets AT blagusrok totuneornottotuneacasestudyofridgelogisticregressioninsmallorsparsedatasets AT geroldingerangelika totuneornottotuneacasestudyofridgelogisticregressioninsmallorsparsedatasets

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets

Ejemplares similares