Cargando…

A nonparametric multiple imputation approach for missing categorical data

BACKGROUND: Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. METHODS: We propose a nearest-neighbour multiple imputation approac...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhou, Muhan, He, Yulei, Yu, Mandi, Hsu, Chiu-Hsieh
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5461637/ https://www.ncbi.nlm.nih.gov/pubmed/28587662 http://dx.doi.org/10.1186/s12874-017-0360-2

_version_	1783242375853768704
author	Zhou, Muhan He, Yulei Yu, Mandi Hsu, Chiu-Hsieh
author_facet	Zhou, Muhan He, Yulei Yu, Mandi Hsu, Chiu-Hsieh
author_sort	Zhou, Muhan
collection	PubMed
description	BACKGROUND: Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. METHODS: We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. RESULTS: The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. CONCLUSIONS: We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12874-017-0360-2) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5461637
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-54616372017-06-07 A nonparametric multiple imputation approach for missing categorical data Zhou, Muhan He, Yulei Yu, Mandi Hsu, Chiu-Hsieh BMC Med Res Methodol Research Article BACKGROUND: Incomplete categorical variables with more than two categories are common in public health data. However, most of the existing missing-data methods do not use the information from nonresponse (missingness) probabilities. METHODS: We propose a nearest-neighbour multiple imputation approach to impute a missing at random categorical outcome and to estimate the proportion of each category. The donor set for imputation is formed by measuring distances between each missing value with other non-missing values. The distance function is calculated based on a predictive score, which is derived from two working models: one fits a multinomial logistic regression for predicting the missing categorical outcome (the outcome model) and the other fits a logistic regression for predicting missingness probabilities (the missingness model). A weighting scheme is used to accommodate contributions from two working models when generating the predictive score. A missing value is imputed by randomly selecting one of the non-missing values with the smallest distances. We conduct a simulation to evaluate the performance of the proposed method and compare it with several alternative methods. A real-data application is also presented. RESULTS: The simulation study suggests that the proposed method performs well when missingness probabilities are not extreme under some misspecifications of the working models. However, the calibration estimator, which is also based on two working models, can be highly unstable when missingness probabilities for some observations are extremely high. In this scenario, the proposed method produces more stable and better estimates. In addition, proper weights need to be chosen to balance the contributions from the two working models and achieve optimal results for the proposed method. CONCLUSIONS: We conclude that the proposed multiple imputation method is a reasonable approach to dealing with missing categorical outcome data with more than two levels for assessing the distribution of the outcome. In terms of the choices for the working models, we suggest a multinomial logistic regression for predicting the missing outcome and a binary logistic regression for predicting the missingness probability. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12874-017-0360-2) contains supplementary material, which is available to authorized users. BioMed Central 2017-06-06 /pmc/articles/PMC5461637/ /pubmed/28587662 http://dx.doi.org/10.1186/s12874-017-0360-2 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Zhou, Muhan He, Yulei Yu, Mandi Hsu, Chiu-Hsieh A nonparametric multiple imputation approach for missing categorical data
title	A nonparametric multiple imputation approach for missing categorical data
title_full	A nonparametric multiple imputation approach for missing categorical data
title_fullStr	A nonparametric multiple imputation approach for missing categorical data
title_full_unstemmed	A nonparametric multiple imputation approach for missing categorical data
title_short	A nonparametric multiple imputation approach for missing categorical data
title_sort	nonparametric multiple imputation approach for missing categorical data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5461637/ https://www.ncbi.nlm.nih.gov/pubmed/28587662 http://dx.doi.org/10.1186/s12874-017-0360-2
work_keys_str_mv	AT zhoumuhan anonparametricmultipleimputationapproachformissingcategoricaldata AT heyulei anonparametricmultipleimputationapproachformissingcategoricaldata AT yumandi anonparametricmultipleimputationapproachformissingcategoricaldata AT hsuchiuhsieh anonparametricmultipleimputationapproachformissingcategoricaldata AT zhoumuhan nonparametricmultipleimputationapproachformissingcategoricaldata AT heyulei nonparametricmultipleimputationapproachformissingcategoricaldata AT yumandi nonparametricmultipleimputationapproachformissingcategoricaldata AT hsuchiuhsieh nonparametricmultipleimputationapproachformissingcategoricaldata

A nonparametric multiple imputation approach for missing categorical data

Ejemplares similares