Cargando…

The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model

An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) a...

Descripción completa

Detalles Bibliográficos
Autores principales: Guo, Chao-Yu, Yang, Ying-Chen, Chen, Yi-Hau
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8289437/
https://www.ncbi.nlm.nih.gov/pubmed/34291028
http://dx.doi.org/10.3389/fpubh.2021.680054
_version_ 1783724297020243968
author Guo, Chao-Yu
Yang, Ying-Chen
Chen, Yi-Hau
author_facet Guo, Chao-Yu
Yang, Ying-Chen
Chen, Yi-Hau
author_sort Guo, Chao-Yu
collection PubMed
description An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
format Online
Article
Text
id pubmed-8289437
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-82894372021-07-20 The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model Guo, Chao-Yu Yang, Ying-Chen Chen, Yi-Hau Front Public Health Public Health An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations. Frontiers Media S.A. 2021-07-05 /pmc/articles/PMC8289437/ /pubmed/34291028 http://dx.doi.org/10.3389/fpubh.2021.680054 Text en Copyright © 2021 Guo, Yang and Chen. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Public Health
Guo, Chao-Yu
Yang, Ying-Chen
Chen, Yi-Hau
The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model
title The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model
title_full The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model
title_fullStr The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model
title_full_unstemmed The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model
title_short The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model
title_sort optimal machine learning-based missing data imputation for the cox proportional hazard model
topic Public Health
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8289437/
https://www.ncbi.nlm.nih.gov/pubmed/34291028
http://dx.doi.org/10.3389/fpubh.2021.680054
work_keys_str_mv AT guochaoyu theoptimalmachinelearningbasedmissingdataimputationforthecoxproportionalhazardmodel
AT yangyingchen theoptimalmachinelearningbasedmissingdataimputationforthecoxproportionalhazardmodel
AT chenyihau theoptimalmachinelearningbasedmissingdataimputationforthecoxproportionalhazardmodel
AT guochaoyu optimalmachinelearningbasedmissingdataimputationforthecoxproportionalhazardmodel
AT yangyingchen optimalmachinelearningbasedmissingdataimputationforthecoxproportionalhazardmodel
AT chenyihau optimalmachinelearningbasedmissingdataimputationforthecoxproportionalhazardmodel