Cargando…

Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach

BACKGROUND: In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Xiaojie, Chen, Han, Nan, Shan, Kong, Xiangtian, Duan, Huilong, Zhu, Haiyan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9898833/
https://www.ncbi.nlm.nih.gov/pubmed/36662548
http://dx.doi.org/10.2196/38590
_version_ 1784882514782846976
author Chen, Xiaojie
Chen, Han
Nan, Shan
Kong, Xiangtian
Duan, Huilong
Zhu, Haiyan
author_facet Chen, Xiaojie
Chen, Han
Nan, Shan
Kong, Xiangtian
Duan, Huilong
Zhu, Haiyan
author_sort Chen, Xiaojie
collection PubMed
description BACKGROUND: In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases. OBJECTIVE: This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing sudden-death prediction models using emergency medicine (or ED) data. METHODS: We proposed a 3-step approach to deal with data quality issues: a random forest (RF) for missing values, k-means for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R(2) and the κ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used to estimate the model’s performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built. RESULTS: A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R(2) of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the κ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95% CI. Using processed data, the recall of the model reached 0.746, the F(1)-score was 0.73, and the AUROC was 0.708. CONCLUSIONS: The proposed systematic approach is valid for building a prediction model for emergency patients.
format Online
Article
Text
id pubmed-9898833
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-98988332023-02-05 Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach Chen, Xiaojie Chen, Han Nan, Shan Kong, Xiangtian Duan, Huilong Zhu, Haiyan JMIR Med Inform Original Paper BACKGROUND: In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases. OBJECTIVE: This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing sudden-death prediction models using emergency medicine (or ED) data. METHODS: We proposed a 3-step approach to deal with data quality issues: a random forest (RF) for missing values, k-means for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R(2) and the κ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used to estimate the model’s performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built. RESULTS: A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R(2) of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the κ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95% CI. Using processed data, the recall of the model reached 0.746, the F(1)-score was 0.73, and the AUROC was 0.708. CONCLUSIONS: The proposed systematic approach is valid for building a prediction model for emergency patients. JMIR Publications 2023-01-20 /pmc/articles/PMC9898833/ /pubmed/36662548 http://dx.doi.org/10.2196/38590 Text en ©Xiaojie Chen, Han Chen, Shan Nan, Xiangtian Kong, Huilong Duan, Haiyan Zhu. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 20.01.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Chen, Xiaojie
Chen, Han
Nan, Shan
Kong, Xiangtian
Duan, Huilong
Zhu, Haiyan
Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach
title Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach
title_full Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach
title_fullStr Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach
title_full_unstemmed Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach
title_short Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach
title_sort dealing with missing, imbalanced, and sparse features during the development of a prediction model for sudden death using emergency medicine data: machine learning approach
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9898833/
https://www.ncbi.nlm.nih.gov/pubmed/36662548
http://dx.doi.org/10.2196/38590
work_keys_str_mv AT chenxiaojie dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach
AT chenhan dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach
AT nanshan dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach
AT kongxiangtian dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach
AT duanhuilong dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach
AT zhuhaiyan dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach