Cargando…

Comparison of methods for imputing limited-range variables: a simulation study

BACKGROUND: Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individ...

Descripción completa

Detalles Bibliográficos
Autores principales: Rodwell, Laura, Lee, Katherine J, Romaniuk, Helena, Carlin, John B
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021274/
https://www.ncbi.nlm.nih.gov/pubmed/24766825
http://dx.doi.org/10.1186/1471-2288-14-57
_version_ 1782316208403513344
author Rodwell, Laura
Lee, Katherine J
Romaniuk, Helena
Carlin, John B
author_facet Rodwell, Laura
Lee, Katherine J
Romaniuk, Helena
Carlin, John B
author_sort Rodwell, Laura
collection PubMed
description BACKGROUND: Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individual observations. One variable type for which MI may lead to implausible values is a limited-range variable, where imputed values may fall outside the observable range. The aim of this work was to compare methods for imputing limited-range variables, with a focus on those that restrict the range of the imputed values. METHODS: Using data from a study of adolescent health, we consider three variables based on responses to the General Health Questionnaire (GHQ), a tool for detecting minor psychiatric illness. These variables, based on different scoring methods for the GHQ, resulted in three continuous distributions with mild, moderate and severe positive skewness. In an otherwise complete dataset, we set 33% of the GHQ observations to missing completely at random or missing at random; repeating this process to create 1000 datasets with incomplete data for each scenario. For each dataset, we imputed values on the raw scale and following a zero-skewness log transformation using: univariate regression with no rounding; post-imputation rounding; truncated normal regression; and predictive mean matching. We estimated the marginal mean of the GHQ and the association between the GHQ and a fully observed binary outcome, comparing the results with complete data statistics. RESULTS: Imputation with no rounding performed well when applied to data on the raw scale. Post-imputation rounding and imputation using truncated normal regression produced higher marginal means than the complete data estimate when data had a moderate or severe skew, and this was associated with under-coverage of the complete data estimate. Predictive mean matching also produced under-coverage of the complete data estimate. For the estimate of association, all methods produced similar estimates to the complete data. CONCLUSIONS: For data with a limited range, multiple imputation using techniques that restrict the range of imputed values can result in biased estimates for the marginal mean when data are highly skewed.
format Online
Article
Text
id pubmed-4021274
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40212742014-05-28 Comparison of methods for imputing limited-range variables: a simulation study Rodwell, Laura Lee, Katherine J Romaniuk, Helena Carlin, John B BMC Med Res Methodol Research Article BACKGROUND: Multiple imputation (MI) was developed as a method to enable valid inferences to be obtained in the presence of missing data rather than to re-create the missing values. Within the applied setting, it remains unclear how important it is that imputed values should be plausible for individual observations. One variable type for which MI may lead to implausible values is a limited-range variable, where imputed values may fall outside the observable range. The aim of this work was to compare methods for imputing limited-range variables, with a focus on those that restrict the range of the imputed values. METHODS: Using data from a study of adolescent health, we consider three variables based on responses to the General Health Questionnaire (GHQ), a tool for detecting minor psychiatric illness. These variables, based on different scoring methods for the GHQ, resulted in three continuous distributions with mild, moderate and severe positive skewness. In an otherwise complete dataset, we set 33% of the GHQ observations to missing completely at random or missing at random; repeating this process to create 1000 datasets with incomplete data for each scenario. For each dataset, we imputed values on the raw scale and following a zero-skewness log transformation using: univariate regression with no rounding; post-imputation rounding; truncated normal regression; and predictive mean matching. We estimated the marginal mean of the GHQ and the association between the GHQ and a fully observed binary outcome, comparing the results with complete data statistics. RESULTS: Imputation with no rounding performed well when applied to data on the raw scale. Post-imputation rounding and imputation using truncated normal regression produced higher marginal means than the complete data estimate when data had a moderate or severe skew, and this was associated with under-coverage of the complete data estimate. Predictive mean matching also produced under-coverage of the complete data estimate. For the estimate of association, all methods produced similar estimates to the complete data. CONCLUSIONS: For data with a limited range, multiple imputation using techniques that restrict the range of imputed values can result in biased estimates for the marginal mean when data are highly skewed. BioMed Central 2014-04-26 /pmc/articles/PMC4021274/ /pubmed/24766825 http://dx.doi.org/10.1186/1471-2288-14-57 Text en Copyright © 2014 Rodwell et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Rodwell, Laura
Lee, Katherine J
Romaniuk, Helena
Carlin, John B
Comparison of methods for imputing limited-range variables: a simulation study
title Comparison of methods for imputing limited-range variables: a simulation study
title_full Comparison of methods for imputing limited-range variables: a simulation study
title_fullStr Comparison of methods for imputing limited-range variables: a simulation study
title_full_unstemmed Comparison of methods for imputing limited-range variables: a simulation study
title_short Comparison of methods for imputing limited-range variables: a simulation study
title_sort comparison of methods for imputing limited-range variables: a simulation study
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021274/
https://www.ncbi.nlm.nih.gov/pubmed/24766825
http://dx.doi.org/10.1186/1471-2288-14-57
work_keys_str_mv AT rodwelllaura comparisonofmethodsforimputinglimitedrangevariablesasimulationstudy
AT leekatherinej comparisonofmethodsforimputinglimitedrangevariablesasimulationstudy
AT romaniukhelena comparisonofmethodsforimputinglimitedrangevariablesasimulationstudy
AT carlinjohnb comparisonofmethodsforimputinglimitedrangevariablesasimulationstudy