Cargando…

A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods

BACKGROUND: For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variab...

Descripción completa

Detalles Bibliográficos
Autores principales:	Panken, A. M., Heymans, M. W.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9351113/ https://www.ncbi.nlm.nih.gov/pubmed/35927610 http://dx.doi.org/10.1186/s12874-022-01693-8

_version_	1784762369376780288
author	Panken, A. M. Heymans, M. W.
author_facet	Panken, A. M. Heymans, M. W.
author_sort	Panken, A. M.
collection	PubMed
description	BACKGROUND: For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variable selection in multiple imputed datasets. These methods are the D1, D2, D3 and recently extended Median-P-Rule (MPR) for categorical, dichotomous, and continuous variables in logistic regression models. METHODS: Four datasets (n = 200 and n = 500), with 9 variables and correlations of respectively 0.2 and 0.6 between these variables, were simulated. These datasets included 2 categorical and 2 continuous variables with 20% missing at random data. Multiple Imputation (m = 5) was applied, and the four methods were compared with selection from the full model (without missing data). The same analyzes were repeated in five multiply imputed real-world datasets (NHANES) (m = 5, p = 0.05, N = 250/300/400/500/1000). RESULTS: In the simulated datasets, the differences between the pooling methods were most evident in the smaller datasets. The MPR performed equal to all other pooling methods for the selection frequency, as well as for the P-values of the continuous and dichotomous variables, however the MPR performed consistently better for pooling and selecting categorical variables in multiply imputed datasets and also regarding the stability of the selected prognostic models. Analyzes in the NHANES-dataset showed that all methods mostly selected the same models. Compared to each other however, the D2-method seemed to be the least sensitive and the MPR the most sensitive, most simple, and easy method to apply. CONCLUSIONS: Considering that MPR is the most simple and easy pooling method to use for epidemiologists and applied researchers, we carefully recommend using the MPR-method to pool categorical variables with more than two levels after Multiple Imputation in combination with Backward Selection-procedures (BWS). Because MPR never performed worse than the other methods in continuous and dichotomous variables we also advice to use MPR in these types of variables.
format	Online Article Text
id	pubmed-9351113
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-93511132022-08-05 A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods Panken, A. M. Heymans, M. W. BMC Med Res Methodol Research BACKGROUND: For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variable selection in multiple imputed datasets. These methods are the D1, D2, D3 and recently extended Median-P-Rule (MPR) for categorical, dichotomous, and continuous variables in logistic regression models. METHODS: Four datasets (n = 200 and n = 500), with 9 variables and correlations of respectively 0.2 and 0.6 between these variables, were simulated. These datasets included 2 categorical and 2 continuous variables with 20% missing at random data. Multiple Imputation (m = 5) was applied, and the four methods were compared with selection from the full model (without missing data). The same analyzes were repeated in five multiply imputed real-world datasets (NHANES) (m = 5, p = 0.05, N = 250/300/400/500/1000). RESULTS: In the simulated datasets, the differences between the pooling methods were most evident in the smaller datasets. The MPR performed equal to all other pooling methods for the selection frequency, as well as for the P-values of the continuous and dichotomous variables, however the MPR performed consistently better for pooling and selecting categorical variables in multiply imputed datasets and also regarding the stability of the selected prognostic models. Analyzes in the NHANES-dataset showed that all methods mostly selected the same models. Compared to each other however, the D2-method seemed to be the least sensitive and the MPR the most sensitive, most simple, and easy method to apply. CONCLUSIONS: Considering that MPR is the most simple and easy pooling method to use for epidemiologists and applied researchers, we carefully recommend using the MPR-method to pool categorical variables with more than two levels after Multiple Imputation in combination with Backward Selection-procedures (BWS). Because MPR never performed worse than the other methods in continuous and dichotomous variables we also advice to use MPR in these types of variables. BioMed Central 2022-08-04 /pmc/articles/PMC9351113/ /pubmed/35927610 http://dx.doi.org/10.1186/s12874-022-01693-8 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Panken, A. M. Heymans, M. W. A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
title	A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
title_full	A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
title_fullStr	A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
title_full_unstemmed	A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
title_short	A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
title_sort	simple pooling method for variable selection in multiply imputed datasets outperformed complex methods
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9351113/ https://www.ncbi.nlm.nih.gov/pubmed/35927610 http://dx.doi.org/10.1186/s12874-022-01693-8
work_keys_str_mv	AT pankenam asimplepoolingmethodforvariableselectioninmultiplyimputeddatasetsoutperformedcomplexmethods AT heymansmw asimplepoolingmethodforvariableselectioninmultiplyimputeddatasetsoutperformedcomplexmethods AT pankenam simplepoolingmethodforvariableselectioninmultiplyimputeddatasetsoutperformedcomplexmethods AT heymansmw simplepoolingmethodforvariableselectioninmultiplyimputeddatasetsoutperformedcomplexmethods

A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods

Ejemplares similares