Cargando…

Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic

BACKGROUND: Poor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to re...

Descripción completa

Detalles Bibliográficos
Autores principales: Feng, Shuo, Hategeka, Celestin, Grépin, Karen Ann
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8567342/
https://www.ncbi.nlm.nih.gov/pubmed/34736462
http://dx.doi.org/10.1186/s12963-021-00274-z
_version_ 1784594210044772352
author Feng, Shuo
Hategeka, Celestin
Grépin, Karen Ann
author_facet Feng, Shuo
Hategeka, Celestin
Grépin, Karen Ann
author_sort Feng, Shuo
collection PubMed
description BACKGROUND: Poor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to report to the central system. METHODS: Using data from the health management information system in the Democratic Republic of the Congo and the advent of COVID-19 pandemic as an illustrative case study, we implemented seven commonly used imputation methods and evaluated their performance in terms of minimizing bias in imputed values and parameter estimates generated through subsequent analytical techniques, namely segmented regression, which is widely used in interrupted time series studies, and pre–post-comparisons through paired Wilcoxon rank-sum tests. We also examined the performance of these imputation methods under different missing mechanisms and tested their stability to changes in the data. RESULTS: For regression analyses, there were no substantial differences found in the coefficient estimates generated from all methods except mean imputation and exclusion and interpolation when the data contained less than 20% missing values. However, as the missing proportion grew, k-NN started to produce biased estimates. Machine learning algorithms, i.e. missForest and k-NN, were also found to lack robustness to small changes in the data or consecutive missingness. On the other hand, multiple imputation methods generated the overall most unbiased estimates and were the most robust to all changes in data. They also produced smaller standard errors than single imputations. For pre–post-comparisons, all methods produced p values less than 0.01, regardless of the amount of missingness introduced, suggesting low sensitivity of Wilcoxon rank-sum tests to the imputation method used. CONCLUSIONS: We recommend the use of multiple imputation in addressing missing values in RHIS datasets and appropriate handling of data structure to minimize imputation standard errors. In cases where necessary computing resources are unavailable for multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion and interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus, their use in the handling of missing values should be discouraged. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12963-021-00274-z.
format Online
Article
Text
id pubmed-8567342
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-85673422021-11-04 Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic Feng, Shuo Hategeka, Celestin Grépin, Karen Ann Popul Health Metr Research BACKGROUND: Poor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to report to the central system. METHODS: Using data from the health management information system in the Democratic Republic of the Congo and the advent of COVID-19 pandemic as an illustrative case study, we implemented seven commonly used imputation methods and evaluated their performance in terms of minimizing bias in imputed values and parameter estimates generated through subsequent analytical techniques, namely segmented regression, which is widely used in interrupted time series studies, and pre–post-comparisons through paired Wilcoxon rank-sum tests. We also examined the performance of these imputation methods under different missing mechanisms and tested their stability to changes in the data. RESULTS: For regression analyses, there were no substantial differences found in the coefficient estimates generated from all methods except mean imputation and exclusion and interpolation when the data contained less than 20% missing values. However, as the missing proportion grew, k-NN started to produce biased estimates. Machine learning algorithms, i.e. missForest and k-NN, were also found to lack robustness to small changes in the data or consecutive missingness. On the other hand, multiple imputation methods generated the overall most unbiased estimates and were the most robust to all changes in data. They also produced smaller standard errors than single imputations. For pre–post-comparisons, all methods produced p values less than 0.01, regardless of the amount of missingness introduced, suggesting low sensitivity of Wilcoxon rank-sum tests to the imputation method used. CONCLUSIONS: We recommend the use of multiple imputation in addressing missing values in RHIS datasets and appropriate handling of data structure to minimize imputation standard errors. In cases where necessary computing resources are unavailable for multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion and interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus, their use in the handling of missing values should be discouraged. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12963-021-00274-z. BioMed Central 2021-11-04 /pmc/articles/PMC8567342/ /pubmed/34736462 http://dx.doi.org/10.1186/s12963-021-00274-z Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Feng, Shuo
Hategeka, Celestin
Grépin, Karen Ann
Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic
title Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic
title_full Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic
title_fullStr Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic
title_full_unstemmed Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic
title_short Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic
title_sort addressing missing values in routine health information system data: an evaluation of imputation methods using data from the democratic republic of the congo during the covid-19 pandemic
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8567342/
https://www.ncbi.nlm.nih.gov/pubmed/34736462
http://dx.doi.org/10.1186/s12963-021-00274-z
work_keys_str_mv AT fengshuo addressingmissingvaluesinroutinehealthinformationsystemdataanevaluationofimputationmethodsusingdatafromthedemocraticrepublicofthecongoduringthecovid19pandemic
AT hategekacelestin addressingmissingvaluesinroutinehealthinformationsystemdataanevaluationofimputationmethodsusingdatafromthedemocraticrepublicofthecongoduringthecovid19pandemic
AT grepinkarenann addressingmissingvaluesinroutinehealthinformationsystemdataanevaluationofimputationmethodsusingdatafromthedemocraticrepublicofthecongoduringthecovid19pandemic