Cargando…

An explainable artificial intelligence framework for risk prediction of COPD in smokers

BACKGROUND: Since the inconspicuous nature of early signs associated with Chronic Obstructive Pulmonary Disease (COPD), individuals often remain unidentified, leading to suboptimal opportunities for timely prevention and treatment. The purpose of this study was to create an explainable artificial in...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Xuchun, Qiao, Yuchao, Cui, Yu, Ren, Hao, Zhao, Ying, Linghu, Liqin, Ren, Jiahui, Zhao, Zhiyang, Chen, Limin, Qiu, Lixia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10626705/
https://www.ncbi.nlm.nih.gov/pubmed/37932692
http://dx.doi.org/10.1186/s12889-023-17011-w
_version_ 1785131392513867776
author Wang, Xuchun
Qiao, Yuchao
Cui, Yu
Ren, Hao
Zhao, Ying
Linghu, Liqin
Ren, Jiahui
Zhao, Zhiyang
Chen, Limin
Qiu, Lixia
author_facet Wang, Xuchun
Qiao, Yuchao
Cui, Yu
Ren, Hao
Zhao, Ying
Linghu, Liqin
Ren, Jiahui
Zhao, Zhiyang
Chen, Limin
Qiu, Lixia
author_sort Wang, Xuchun
collection PubMed
description BACKGROUND: Since the inconspicuous nature of early signs associated with Chronic Obstructive Pulmonary Disease (COPD), individuals often remain unidentified, leading to suboptimal opportunities for timely prevention and treatment. The purpose of this study was to create an explainable artificial intelligence framework combining data preprocessing methods, machine learning methods, and model interpretability methods to identify people at high risk of COPD in the smoking population and to provide a reasonable interpretation of model predictions. METHODS: The data comprised questionnaire information, physical examination data and results of pulmonary function tests before and after bronchodilatation. First, the factorial analysis for mixed data (FAMD), Boruta and NRSBoundary-SMOTE resampling methods were used to solve the missing data, high dimensionality and category imbalance problems. Then, seven classification models (CatBoost, NGBoost, XGBoost, LightGBM, random forest, SVM and logistic regression) were applied to model the risk level, and the best machine learning (ML) model’s decisions were explained using the Shapley additive explanations (SHAP) method and partial dependence plot (PDP). RESULTS: In the smoking population, age and 14 other variables were significant factors for predicting COPD. The CatBoost, random forest, and logistic regression models performed reasonably well in unbalanced datasets. CatBoost with NRSBoundary-SMOTE had the best classification performance in balanced datasets when composite indicators (the AUC, F1-score, and G-mean) were used as model comparison criteria. Age, COPD Assessment Test (CAT) score, gross annual income, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), anhelation, respiratory disease, central obesity, use of polluting fuel for household heating, region, use of polluting fuel for household cooking, and wheezing were important factors for predicting COPD in the smoking population. CONCLUSION: This study combined feature screening methods, unbalanced data processing methods, and advanced machine learning methods to enable early identification of COPD risk groups in the smoking population. COPD risk factors in the smoking population were identified using SHAP and PDP, with the goal of providing theoretical support for targeted screening strategies and smoking population self-management strategies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12889-023-17011-w.
format Online
Article
Text
id pubmed-10626705
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-106267052023-11-07 An explainable artificial intelligence framework for risk prediction of COPD in smokers Wang, Xuchun Qiao, Yuchao Cui, Yu Ren, Hao Zhao, Ying Linghu, Liqin Ren, Jiahui Zhao, Zhiyang Chen, Limin Qiu, Lixia BMC Public Health Research BACKGROUND: Since the inconspicuous nature of early signs associated with Chronic Obstructive Pulmonary Disease (COPD), individuals often remain unidentified, leading to suboptimal opportunities for timely prevention and treatment. The purpose of this study was to create an explainable artificial intelligence framework combining data preprocessing methods, machine learning methods, and model interpretability methods to identify people at high risk of COPD in the smoking population and to provide a reasonable interpretation of model predictions. METHODS: The data comprised questionnaire information, physical examination data and results of pulmonary function tests before and after bronchodilatation. First, the factorial analysis for mixed data (FAMD), Boruta and NRSBoundary-SMOTE resampling methods were used to solve the missing data, high dimensionality and category imbalance problems. Then, seven classification models (CatBoost, NGBoost, XGBoost, LightGBM, random forest, SVM and logistic regression) were applied to model the risk level, and the best machine learning (ML) model’s decisions were explained using the Shapley additive explanations (SHAP) method and partial dependence plot (PDP). RESULTS: In the smoking population, age and 14 other variables were significant factors for predicting COPD. The CatBoost, random forest, and logistic regression models performed reasonably well in unbalanced datasets. CatBoost with NRSBoundary-SMOTE had the best classification performance in balanced datasets when composite indicators (the AUC, F1-score, and G-mean) were used as model comparison criteria. Age, COPD Assessment Test (CAT) score, gross annual income, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), anhelation, respiratory disease, central obesity, use of polluting fuel for household heating, region, use of polluting fuel for household cooking, and wheezing were important factors for predicting COPD in the smoking population. CONCLUSION: This study combined feature screening methods, unbalanced data processing methods, and advanced machine learning methods to enable early identification of COPD risk groups in the smoking population. COPD risk factors in the smoking population were identified using SHAP and PDP, with the goal of providing theoretical support for targeted screening strategies and smoking population self-management strategies. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12889-023-17011-w. BioMed Central 2023-11-06 /pmc/articles/PMC10626705/ /pubmed/37932692 http://dx.doi.org/10.1186/s12889-023-17011-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Wang, Xuchun
Qiao, Yuchao
Cui, Yu
Ren, Hao
Zhao, Ying
Linghu, Liqin
Ren, Jiahui
Zhao, Zhiyang
Chen, Limin
Qiu, Lixia
An explainable artificial intelligence framework for risk prediction of COPD in smokers
title An explainable artificial intelligence framework for risk prediction of COPD in smokers
title_full An explainable artificial intelligence framework for risk prediction of COPD in smokers
title_fullStr An explainable artificial intelligence framework for risk prediction of COPD in smokers
title_full_unstemmed An explainable artificial intelligence framework for risk prediction of COPD in smokers
title_short An explainable artificial intelligence framework for risk prediction of COPD in smokers
title_sort explainable artificial intelligence framework for risk prediction of copd in smokers
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10626705/
https://www.ncbi.nlm.nih.gov/pubmed/37932692
http://dx.doi.org/10.1186/s12889-023-17011-w
work_keys_str_mv AT wangxuchun anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT qiaoyuchao anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT cuiyu anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT renhao anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT zhaoying anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT linghuliqin anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT renjiahui anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT zhaozhiyang anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT chenlimin anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT qiulixia anexplainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT wangxuchun explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT qiaoyuchao explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT cuiyu explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT renhao explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT zhaoying explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT linghuliqin explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT renjiahui explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT zhaozhiyang explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT chenlimin explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers
AT qiulixia explainableartificialintelligenceframeworkforriskpredictionofcopdinsmokers