Cargando…

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

BACKGROUND: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. METHODS: Datasets w...

Descripción completa

Detalles Bibliográficos
Autores principales: Marshall, Andrea, Altman, Douglas G, Royston, Patrick, Holder, Roger L
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2824146/
https://www.ncbi.nlm.nih.gov/pubmed/20085642
http://dx.doi.org/10.1186/1471-2288-10-7
_version_ 1782177699587948544
author Marshall, Andrea
Altman, Douglas G
Royston, Patrick
Holder, Roger L
author_facet Marshall, Andrea
Altman, Douglas G
Royston, Patrick
Holder, Roger L
author_sort Marshall, Andrea
collection PubMed
description BACKGROUND: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. METHODS: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained. RESULTS: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches. CONCLUSION: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.
format Text
id pubmed-2824146
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28241462010-02-19 Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study Marshall, Andrea Altman, Douglas G Royston, Patrick Holder, Roger L BMC Med Res Methodol Research Article BACKGROUND: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. METHODS: Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained. RESULTS: Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches. CONCLUSION: The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR. BioMed Central 2010-01-19 /pmc/articles/PMC2824146/ /pubmed/20085642 http://dx.doi.org/10.1186/1471-2288-10-7 Text en Copyright ©2010 Marshall et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Marshall, Andrea
Altman, Douglas G
Royston, Patrick
Holder, Roger L
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
title Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
title_full Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
title_fullStr Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
title_full_unstemmed Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
title_short Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
title_sort comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2824146/
https://www.ncbi.nlm.nih.gov/pubmed/20085642
http://dx.doi.org/10.1186/1471-2288-10-7
work_keys_str_mv AT marshallandrea comparisonoftechniquesforhandlingmissingcovariatedatawithinprognosticmodellingstudiesasimulationstudy
AT altmandouglasg comparisonoftechniquesforhandlingmissingcovariatedatawithinprognosticmodellingstudiesasimulationstudy
AT roystonpatrick comparisonoftechniquesforhandlingmissingcovariatedatawithinprognosticmodellingstudiesasimulationstudy
AT holderrogerl comparisonoftechniquesforhandlingmissingcovariatedatawithinprognosticmodellingstudiesasimulationstudy