Cargando…

Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES

The prevalence of cardiocerebrovascular disease (CVD) is continuously increasing, and it is the leading cause of human death. Since it is difficult for physicians to screen thousands of people, high-accuracy and interpretable methods need to be presented. We developed four machine learning-based CVD...

Descripción completa

Detalles Bibliográficos
Autores principales: Oh, Taeseob, Kim, Dongkyun, Lee, Siryeol, Won, Changwon, Kim, Sunyoung, Yang, Ji-soo, Yu, Junghwa, Kim, Byungsung, Lee, Joohyun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8831514/
https://www.ncbi.nlm.nih.gov/pubmed/35145205
http://dx.doi.org/10.1038/s41598-022-06333-1
_version_ 1784648521118384128
author Oh, Taeseob
Kim, Dongkyun
Lee, Siryeol
Won, Changwon
Kim, Sunyoung
Yang, Ji-soo
Yu, Junghwa
Kim, Byungsung
Lee, Joohyun
author_facet Oh, Taeseob
Kim, Dongkyun
Lee, Siryeol
Won, Changwon
Kim, Sunyoung
Yang, Ji-soo
Yu, Junghwa
Kim, Byungsung
Lee, Joohyun
author_sort Oh, Taeseob
collection PubMed
description The prevalence of cardiocerebrovascular disease (CVD) is continuously increasing, and it is the leading cause of human death. Since it is difficult for physicians to screen thousands of people, high-accuracy and interpretable methods need to be presented. We developed four machine learning-based CVD classifiers (i.e., multi-layer perceptron, support vector machine, random forest, and light gradient boosting) based on the Korea National Health and Nutrition Examination Survey. We resampled and rebalanced KNHANES data using complex sampling weights such that the rebalanced dataset mimics a uniformly sampled dataset from overall population. For clear risk factor analysis, we removed multicollinearity and CVD-irrelevant variables using VIF-based filtering and the Boruta algorithm. We applied synthetic minority oversampling technique and random undersampling before ML training. We demonstrated that the proposed classifiers achieved excellent performance with AUCs over 0.853. Using Shapley value-based risk factor analysis, we identified that the most significant risk factors of CVD were age, sex, and the prevalence of hypertension. Additionally, we identified that age, hypertension, and BMI were positively correlated with CVD prevalence, while sex (female), alcohol consumption and, monthly income were negative. The results showed that the feature selection and the class balancing technique effectively improve the interpretability of models.
format Online
Article
Text
id pubmed-8831514
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-88315142022-02-14 Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES Oh, Taeseob Kim, Dongkyun Lee, Siryeol Won, Changwon Kim, Sunyoung Yang, Ji-soo Yu, Junghwa Kim, Byungsung Lee, Joohyun Sci Rep Article The prevalence of cardiocerebrovascular disease (CVD) is continuously increasing, and it is the leading cause of human death. Since it is difficult for physicians to screen thousands of people, high-accuracy and interpretable methods need to be presented. We developed four machine learning-based CVD classifiers (i.e., multi-layer perceptron, support vector machine, random forest, and light gradient boosting) based on the Korea National Health and Nutrition Examination Survey. We resampled and rebalanced KNHANES data using complex sampling weights such that the rebalanced dataset mimics a uniformly sampled dataset from overall population. For clear risk factor analysis, we removed multicollinearity and CVD-irrelevant variables using VIF-based filtering and the Boruta algorithm. We applied synthetic minority oversampling technique and random undersampling before ML training. We demonstrated that the proposed classifiers achieved excellent performance with AUCs over 0.853. Using Shapley value-based risk factor analysis, we identified that the most significant risk factors of CVD were age, sex, and the prevalence of hypertension. Additionally, we identified that age, hypertension, and BMI were positively correlated with CVD prevalence, while sex (female), alcohol consumption and, monthly income were negative. The results showed that the feature selection and the class balancing technique effectively improve the interpretability of models. Nature Publishing Group UK 2022-02-10 /pmc/articles/PMC8831514/ /pubmed/35145205 http://dx.doi.org/10.1038/s41598-022-06333-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Oh, Taeseob
Kim, Dongkyun
Lee, Siryeol
Won, Changwon
Kim, Sunyoung
Yang, Ji-soo
Yu, Junghwa
Kim, Byungsung
Lee, Joohyun
Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES
title Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES
title_full Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES
title_fullStr Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES
title_full_unstemmed Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES
title_short Machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on KNHANES
title_sort machine learning-based diagnosis and risk factor analysis of cardiocerebrovascular disease based on knhanes
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8831514/
https://www.ncbi.nlm.nih.gov/pubmed/35145205
http://dx.doi.org/10.1038/s41598-022-06333-1
work_keys_str_mv AT ohtaeseob machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT kimdongkyun machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT leesiryeol machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT wonchangwon machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT kimsunyoung machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT yangjisoo machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT yujunghwa machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT kimbyungsung machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes
AT leejoohyun machinelearningbaseddiagnosisandriskfactoranalysisofcardiocerebrovasculardiseasebasedonknhanes