Cargando…
A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction
BACKGROUND: The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study i...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7505424/ https://www.ncbi.nlm.nih.gov/pubmed/32956366 http://dx.doi.org/10.1371/journal.pone.0237724 |
_version_ | 1783584808869298176 |
---|---|
author | Hu, Zhiyong Du, Dongping |
author_facet | Hu, Zhiyong Du, Dongping |
author_sort | Hu, Zhiyong |
collection | PubMed |
description | BACKGROUND: The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients. METHODS: We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors. RESULTS: The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers. CONCLUSIONS: The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs. |
format | Online Article Text |
id | pubmed-7505424 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-75054242020-09-30 A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction Hu, Zhiyong Du, Dongping PLoS One Research Article BACKGROUND: The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients. METHODS: We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors. RESULTS: The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers. CONCLUSIONS: The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs. Public Library of Science 2020-09-21 /pmc/articles/PMC7505424/ /pubmed/32956366 http://dx.doi.org/10.1371/journal.pone.0237724 Text en © 2020 Hu, Du http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Hu, Zhiyong Du, Dongping A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction |
title | A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction |
title_full | A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction |
title_fullStr | A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction |
title_full_unstemmed | A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction |
title_short | A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction |
title_sort | new analytical framework for missing data imputation and classification with uncertainty: missing data imputation and heart failure readmission prediction |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7505424/ https://www.ncbi.nlm.nih.gov/pubmed/32956366 http://dx.doi.org/10.1371/journal.pone.0237724 |
work_keys_str_mv | AT huzhiyong anewanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction AT dudongping anewanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction AT huzhiyong newanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction AT dudongping newanalyticalframeworkformissingdataimputationandclassificationwithuncertaintymissingdataimputationandheartfailurereadmissionprediction |