Cargando…
A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods
The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10256703/ https://www.ncbi.nlm.nih.gov/pubmed/37296269 http://dx.doi.org/10.1038/s41598-023-36509-2 |
_version_ | 1785057162961092608 |
---|---|
author | Ge, Yingfeng Li, Zhiwei Zhang, Jinxin |
author_facet | Ge, Yingfeng Li, Zhiwei Zhang, Jinxin |
author_sort | Ge, Yingfeng |
collection | PubMed |
description | The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data. |
format | Online Article Text |
id | pubmed-10256703 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-102567032023-06-11 A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods Ge, Yingfeng Li, Zhiwei Zhang, Jinxin Sci Rep Article The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data. Nature Publishing Group UK 2023-06-09 /pmc/articles/PMC10256703/ /pubmed/37296269 http://dx.doi.org/10.1038/s41598-023-36509-2 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Ge, Yingfeng Li, Zhiwei Zhang, Jinxin A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods |
title | A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods |
title_full | A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods |
title_fullStr | A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods |
title_full_unstemmed | A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods |
title_short | A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods |
title_sort | simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10256703/ https://www.ncbi.nlm.nih.gov/pubmed/37296269 http://dx.doi.org/10.1038/s41598-023-36509-2 |
work_keys_str_mv | AT geyingfeng asimulationstudyonmissingdataimputationfordichotomousvariablesusingstatisticalandmachinelearningmethods AT lizhiwei asimulationstudyonmissingdataimputationfordichotomousvariablesusingstatisticalandmachinelearningmethods AT zhangjinxin asimulationstudyonmissingdataimputationfordichotomousvariablesusingstatisticalandmachinelearningmethods AT geyingfeng simulationstudyonmissingdataimputationfordichotomousvariablesusingstatisticalandmachinelearningmethods AT lizhiwei simulationstudyonmissingdataimputationfordichotomousvariablesusingstatisticalandmachinelearningmethods AT zhangjinxin simulationstudyonmissingdataimputationfordichotomousvariablesusingstatisticalandmachinelearningmethods |