Cargando…

A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data

BACKGROUND: Prognostic models can help to identify patients at risk for end-stage kidney disease (ESKD) at an earlier stage to provide preventive medical interventions. Previous studies mostly applied the Cox proportional hazards model. The aim of this study is to present a resampling method, which...

Descripción completa

Detalles Bibliográficos
Autores principales: Shi, Xi, Qu, Tingyu, Van Pottelbergh, Gijs, van den Akker, Marjan, De Moor, Bart
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8935060/
https://www.ncbi.nlm.nih.gov/pubmed/35321465
http://dx.doi.org/10.3389/fmed.2022.730748
_version_ 1784671966838390784
author Shi, Xi
Qu, Tingyu
Van Pottelbergh, Gijs
van den Akker, Marjan
De Moor, Bart
author_facet Shi, Xi
Qu, Tingyu
Van Pottelbergh, Gijs
van den Akker, Marjan
De Moor, Bart
author_sort Shi, Xi
collection PubMed
description BACKGROUND: Prognostic models can help to identify patients at risk for end-stage kidney disease (ESKD) at an earlier stage to provide preventive medical interventions. Previous studies mostly applied the Cox proportional hazards model. The aim of this study is to present a resampling method, which can deal with imbalanced data structure for the prognostic model and help to improve predictive performance. METHODS: The electronic health records of patients with chronic kidney disease (CKD) older than 50 years during 2005–2015 collected from primary care in Belgium were used (n = 11,645). Both the Cox proportional hazards model and the logistic regression analysis were applied as reference model. Then, the resampling method, the Synthetic Minority Over-Sampling Technique-Edited Nearest Neighbor (SMOTE-ENN), was applied as a preprocessing procedure followed by the logistic regression analysis. The performance was evaluated by accuracy, the area under the curve (AUC), confusion matrix, and F(3) score. RESULTS: The C statistics for the Cox proportional hazards model was 0.807, while the AUC for the logistic regression analysis was 0.700, both on a comparable level to previous studies. With the model trained on the resampled set, 86.3% of patients with ESKD were correctly identified, although it was at the cost of the high misclassification rate of negative cases. The F(3) score was 0.245, much higher than 0.043 for the logistic regression analysis and 0.022 for the Cox proportional hazards model. CONCLUSION: This study pointed out the imbalanced data structure and its effects on prediction accuracy, which were not thoroughly discussed in previous studies. We were able to identify patients with high risk for ESKD better from a clinical perspective by using the resampling method. But, it has the limitation of the high misclassification of negative cases. The technique can be widely used in other clinical topics when imbalanced data structure should be considered.
format Online
Article
Text
id pubmed-8935060
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-89350602022-03-22 A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data Shi, Xi Qu, Tingyu Van Pottelbergh, Gijs van den Akker, Marjan De Moor, Bart Front Med (Lausanne) Medicine BACKGROUND: Prognostic models can help to identify patients at risk for end-stage kidney disease (ESKD) at an earlier stage to provide preventive medical interventions. Previous studies mostly applied the Cox proportional hazards model. The aim of this study is to present a resampling method, which can deal with imbalanced data structure for the prognostic model and help to improve predictive performance. METHODS: The electronic health records of patients with chronic kidney disease (CKD) older than 50 years during 2005–2015 collected from primary care in Belgium were used (n = 11,645). Both the Cox proportional hazards model and the logistic regression analysis were applied as reference model. Then, the resampling method, the Synthetic Minority Over-Sampling Technique-Edited Nearest Neighbor (SMOTE-ENN), was applied as a preprocessing procedure followed by the logistic regression analysis. The performance was evaluated by accuracy, the area under the curve (AUC), confusion matrix, and F(3) score. RESULTS: The C statistics for the Cox proportional hazards model was 0.807, while the AUC for the logistic regression analysis was 0.700, both on a comparable level to previous studies. With the model trained on the resampled set, 86.3% of patients with ESKD were correctly identified, although it was at the cost of the high misclassification rate of negative cases. The F(3) score was 0.245, much higher than 0.043 for the logistic regression analysis and 0.022 for the Cox proportional hazards model. CONCLUSION: This study pointed out the imbalanced data structure and its effects on prediction accuracy, which were not thoroughly discussed in previous studies. We were able to identify patients with high risk for ESKD better from a clinical perspective by using the resampling method. But, it has the limitation of the high misclassification of negative cases. The technique can be widely used in other clinical topics when imbalanced data structure should be considered. Frontiers Media S.A. 2022-03-07 /pmc/articles/PMC8935060/ /pubmed/35321465 http://dx.doi.org/10.3389/fmed.2022.730748 Text en Copyright © 2022 Shi, Qu, Van Pottelbergh, van den Akker and De Moor. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Medicine
Shi, Xi
Qu, Tingyu
Van Pottelbergh, Gijs
van den Akker, Marjan
De Moor, Bart
A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data
title A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data
title_full A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data
title_fullStr A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data
title_full_unstemmed A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data
title_short A Resampling Method to Improve the Prognostic Model of End-Stage Kidney Disease: A Better Strategy for Imbalanced Data
title_sort resampling method to improve the prognostic model of end-stage kidney disease: a better strategy for imbalanced data
topic Medicine
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8935060/
https://www.ncbi.nlm.nih.gov/pubmed/35321465
http://dx.doi.org/10.3389/fmed.2022.730748
work_keys_str_mv AT shixi aresamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT qutingyu aresamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT vanpottelberghgijs aresamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT vandenakkermarjan aresamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT demoorbart aresamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT shixi resamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT qutingyu resamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT vanpottelberghgijs resamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT vandenakkermarjan resamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata
AT demoorbart resamplingmethodtoimprovetheprognosticmodelofendstagekidneydiseaseabetterstrategyforimbalanceddata