Cargando…

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can acc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shah, Anoop D., Bartlett, Jonathan W., Carpenter, James, Nicholas, Owen, Hemingway, Harry
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2014
Materias:	Practice of Epidemiology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3939843/ https://www.ncbi.nlm.nih.gov/pubmed/24589914 http://dx.doi.org/10.1093/aje/kwt312

_version_	1782305745586356224
author	Shah, Anoop D. Bartlett, Jonathan W. Carpenter, James Nicholas, Owen Hemingway, Harry
author_facet	Shah, Anoop D. Bartlett, Jonathan W. Carpenter, James Nicholas, Owen Hemingway, Harry
author_sort	Shah, Anoop D.
collection	PubMed
description	Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
format	Online Article Text
id	pubmed-3939843
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-39398432014-03-04 Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study Shah, Anoop D. Bartlett, Jonathan W. Carpenter, James Nicholas, Owen Hemingway, Harry Am J Epidemiol Practice of Epidemiology Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. Oxford University Press 2014-03-15 2014-01-12 /pmc/articles/PMC3939843/ /pubmed/24589914 http://dx.doi.org/10.1093/aje/kwt312 Text en © The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Practice of Epidemiology Shah, Anoop D. Bartlett, Jonathan W. Carpenter, James Nicholas, Owen Hemingway, Harry Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
title	Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
title_full	Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
title_fullStr	Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
title_full_unstemmed	Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
title_short	Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study
title_sort	comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study
topic	Practice of Epidemiology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3939843/ https://www.ncbi.nlm.nih.gov/pubmed/24589914 http://dx.doi.org/10.1093/aje/kwt312
work_keys_str_mv	AT shahanoopd comparisonofrandomforestandparametricimputationmodelsforimputingmissingdatausingmiceacaliberstudy AT bartlettjonathanw comparisonofrandomforestandparametricimputationmodelsforimputingmissingdatausingmiceacaliberstudy AT carpenterjames comparisonofrandomforestandparametricimputationmodelsforimputingmissingdatausingmiceacaliberstudy AT nicholasowen comparisonofrandomforestandparametricimputationmodelsforimputingmissingdatausingmiceacaliberstudy AT hemingwayharry comparisonofrandomforestandparametricimputationmodelsforimputingmissingdatausingmiceacaliberstudy

Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

Ejemplares similares