Cargando…
Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach
BACKGROUND: In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
JMIR Publications
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9898833/ https://www.ncbi.nlm.nih.gov/pubmed/36662548 http://dx.doi.org/10.2196/38590 |
_version_ | 1784882514782846976 |
---|---|
author | Chen, Xiaojie Chen, Han Nan, Shan Kong, Xiangtian Duan, Huilong Zhu, Haiyan |
author_facet | Chen, Xiaojie Chen, Han Nan, Shan Kong, Xiangtian Duan, Huilong Zhu, Haiyan |
author_sort | Chen, Xiaojie |
collection | PubMed |
description | BACKGROUND: In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases. OBJECTIVE: This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing sudden-death prediction models using emergency medicine (or ED) data. METHODS: We proposed a 3-step approach to deal with data quality issues: a random forest (RF) for missing values, k-means for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R(2) and the κ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used to estimate the model’s performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built. RESULTS: A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R(2) of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the κ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95% CI. Using processed data, the recall of the model reached 0.746, the F(1)-score was 0.73, and the AUROC was 0.708. CONCLUSIONS: The proposed systematic approach is valid for building a prediction model for emergency patients. |
format | Online Article Text |
id | pubmed-9898833 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | JMIR Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-98988332023-02-05 Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach Chen, Xiaojie Chen, Han Nan, Shan Kong, Xiangtian Duan, Huilong Zhu, Haiyan JMIR Med Inform Original Paper BACKGROUND: In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases. OBJECTIVE: This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing sudden-death prediction models using emergency medicine (or ED) data. METHODS: We proposed a 3-step approach to deal with data quality issues: a random forest (RF) for missing values, k-means for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R(2) and the κ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used to estimate the model’s performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built. RESULTS: A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R(2) of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the κ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95% CI. Using processed data, the recall of the model reached 0.746, the F(1)-score was 0.73, and the AUROC was 0.708. CONCLUSIONS: The proposed systematic approach is valid for building a prediction model for emergency patients. JMIR Publications 2023-01-20 /pmc/articles/PMC9898833/ /pubmed/36662548 http://dx.doi.org/10.2196/38590 Text en ©Xiaojie Chen, Han Chen, Shan Nan, Xiangtian Kong, Huilong Duan, Haiyan Zhu. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 20.01.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included. |
spellingShingle | Original Paper Chen, Xiaojie Chen, Han Nan, Shan Kong, Xiangtian Duan, Huilong Zhu, Haiyan Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach |
title | Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach |
title_full | Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach |
title_fullStr | Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach |
title_full_unstemmed | Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach |
title_short | Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach |
title_sort | dealing with missing, imbalanced, and sparse features during the development of a prediction model for sudden death using emergency medicine data: machine learning approach |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9898833/ https://www.ncbi.nlm.nih.gov/pubmed/36662548 http://dx.doi.org/10.2196/38590 |
work_keys_str_mv | AT chenxiaojie dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach AT chenhan dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach AT nanshan dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach AT kongxiangtian dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach AT duanhuilong dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach AT zhuhaiyan dealingwithmissingimbalancedandsparsefeaturesduringthedevelopmentofapredictionmodelforsuddendeathusingemergencymedicinedatamachinelearningapproach |