Cargando…

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage

BACKGROUND: Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider ou...

Descripción completa

Detalles Bibliográficos
Autores principales: Tang, Jianxiang, Wang, Xiaoyu, Wan, Hongli, Lin, Chunying, Shao, Zilun, Chang, Yang, Wang, Hexuan, Wu, Yi, Zhang, Tao, Du, Yu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9594939/
https://www.ncbi.nlm.nih.gov/pubmed/36284327
http://dx.doi.org/10.1186/s12911-022-02018-x
_version_ 1784815543288594432
author Tang, Jianxiang
Wang, Xiaoyu
Wan, Hongli
Lin, Chunying
Shao, Zilun
Chang, Yang
Wang, Hexuan
Wu, Yi
Zhang, Tao
Du, Yu
author_facet Tang, Jianxiang
Wang, Xiaoyu
Wan, Hongli
Lin, Chunying
Shao, Zilun
Chang, Yang
Wang, Hexuan
Wu, Yi
Zhang, Tao
Du, Yu
author_sort Tang, Jianxiang
collection PubMed
description BACKGROUND: Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling. METHODS: This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017–2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost). RESULTS: Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938). CONCLUSION: This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-022-02018-x.
format Online
Article
Text
id pubmed-9594939
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-95949392022-10-26 Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage Tang, Jianxiang Wang, Xiaoyu Wan, Hongli Lin, Chunying Shao, Zilun Chang, Yang Wang, Hexuan Wu, Yi Zhang, Tao Du, Yu BMC Med Inform Decis Mak Research BACKGROUND: Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling. METHODS: This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017–2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost). RESULTS: Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938). CONCLUSION: This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-022-02018-x. BioMed Central 2022-10-25 /pmc/articles/PMC9594939/ /pubmed/36284327 http://dx.doi.org/10.1186/s12911-022-02018-x Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Tang, Jianxiang
Wang, Xiaoyu
Wan, Hongli
Lin, Chunying
Shao, Zilun
Chang, Yang
Wang, Hexuan
Wu, Yi
Zhang, Tao
Du, Yu
Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
title Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
title_full Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
title_fullStr Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
title_full_unstemmed Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
title_short Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
title_sort joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9594939/
https://www.ncbi.nlm.nih.gov/pubmed/36284327
http://dx.doi.org/10.1186/s12911-022-02018-x
work_keys_str_mv AT tangjianxiang jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT wangxiaoyu jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT wanhongli jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT linchunying jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT shaozilun jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT changyang jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT wanghexuan jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT wuyi jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT zhangtao jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage
AT duyu jointmodelingstrategyforusingelectronicmedicalrecordsdatatobuildmachinelearningmodelsanexampleofintracerebralhemorrhage