Cargando…

Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region

BACKGROUND: Cardiovascular diseases (CVD) are the predominant cause of early death worldwide. Identification of people with a high risk of being affected by CVD is consequential in CVD prevention. This study adopts Machine Learning (ML) and statistical techniques to develop classification models for...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mehrabani-Zeinabad, Kamran, Feizi, Awat, Sadeghi, Masoumeh, Roohafza, Hamidreza, Talaei, Mohammad, Sarrafzadegan, Nizal
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10116769/ https://www.ncbi.nlm.nih.gov/pubmed/37076833 http://dx.doi.org/10.1186/s12911-023-02169-5

_version_	1785028496133718016
author	Mehrabani-Zeinabad, Kamran Feizi, Awat Sadeghi, Masoumeh Roohafza, Hamidreza Talaei, Mohammad Sarrafzadegan, Nizal
author_facet	Mehrabani-Zeinabad, Kamran Feizi, Awat Sadeghi, Masoumeh Roohafza, Hamidreza Talaei, Mohammad Sarrafzadegan, Nizal
author_sort	Mehrabani-Zeinabad, Kamran
collection	PubMed
description	BACKGROUND: Cardiovascular diseases (CVD) are the predominant cause of early death worldwide. Identification of people with a high risk of being affected by CVD is consequential in CVD prevention. This study adopts Machine Learning (ML) and statistical techniques to develop classification models for predicting the future occurrence of CVD events in a large sample of Iranians. METHODS: We used multiple prediction models and ML techniques with different abilities to analyze the large dataset of 5432 healthy people at the beginning of entrance into the Isfahan Cohort Study (ICS) (1990–2017). Bayesian additive regression trees enhanced with “missingness incorporated in attributes” (BARTm) was run on the dataset with 515 variables (336 variables without and the remaining with up to 90% missing values). In the other used classification algorithms, variables with more than 10% missing values were excluded, and MissForest imputes the missing values of the remaining 49 variables. We used Recursive Feature Elimination (RFE) to select the most contributing variables. Random oversampling technique, recommended cut-point by precision-recall curve, and relevant evaluation metrics were used for handling unbalancing in the binary response variable. RESULTS: This study revealed that age, systolic blood pressure, fasting blood sugar, two-hour postprandial glucose, diabetes mellitus, history of heart disease, history of high blood pressure, and history of diabetes are the most contributing factors for predicting CVD incidence in the future. The main differences between the results of classification algorithms are due to the trade-off between sensitivity and specificity. Quadratic Discriminant Analysis (QDA) algorithm presents the highest accuracy (75.50 ± 0.08) but the minimum sensitivity (49.84 ± 0.25); In contrast, decision trees provide the lowest accuracy (51.95 ± 0.69) but the top sensitivity (82.52 ± 1.22). BARTm.90% resulted in 69.48 ± 0.28 accuracy and 54.00 ± 1.66 sensitivity without any preprocessing step. CONCLUSIONS: This study confirmed that building a prediction model for CVD in each region is valuable for screening and primary prevention strategies in that specific region. Also, results showed that using conventional statistical models alongside ML algorithms makes it possible to take advantage of both techniques. Generally, QDA can accurately predict the future occurrence of CVD events with a fast (inference speed) and stable (confidence values) procedure. The combined ML and statistical algorithm of BARTm provide a flexible approach without any need for technical knowledge about assumptions and preprocessing steps of the prediction procedure.
format	Online Article Text
id	pubmed-10116769
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-101167692023-04-21 Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region Mehrabani-Zeinabad, Kamran Feizi, Awat Sadeghi, Masoumeh Roohafza, Hamidreza Talaei, Mohammad Sarrafzadegan, Nizal BMC Med Inform Decis Mak Research BACKGROUND: Cardiovascular diseases (CVD) are the predominant cause of early death worldwide. Identification of people with a high risk of being affected by CVD is consequential in CVD prevention. This study adopts Machine Learning (ML) and statistical techniques to develop classification models for predicting the future occurrence of CVD events in a large sample of Iranians. METHODS: We used multiple prediction models and ML techniques with different abilities to analyze the large dataset of 5432 healthy people at the beginning of entrance into the Isfahan Cohort Study (ICS) (1990–2017). Bayesian additive regression trees enhanced with “missingness incorporated in attributes” (BARTm) was run on the dataset with 515 variables (336 variables without and the remaining with up to 90% missing values). In the other used classification algorithms, variables with more than 10% missing values were excluded, and MissForest imputes the missing values of the remaining 49 variables. We used Recursive Feature Elimination (RFE) to select the most contributing variables. Random oversampling technique, recommended cut-point by precision-recall curve, and relevant evaluation metrics were used for handling unbalancing in the binary response variable. RESULTS: This study revealed that age, systolic blood pressure, fasting blood sugar, two-hour postprandial glucose, diabetes mellitus, history of heart disease, history of high blood pressure, and history of diabetes are the most contributing factors for predicting CVD incidence in the future. The main differences between the results of classification algorithms are due to the trade-off between sensitivity and specificity. Quadratic Discriminant Analysis (QDA) algorithm presents the highest accuracy (75.50 ± 0.08) but the minimum sensitivity (49.84 ± 0.25); In contrast, decision trees provide the lowest accuracy (51.95 ± 0.69) but the top sensitivity (82.52 ± 1.22). BARTm.90% resulted in 69.48 ± 0.28 accuracy and 54.00 ± 1.66 sensitivity without any preprocessing step. CONCLUSIONS: This study confirmed that building a prediction model for CVD in each region is valuable for screening and primary prevention strategies in that specific region. Also, results showed that using conventional statistical models alongside ML algorithms makes it possible to take advantage of both techniques. Generally, QDA can accurately predict the future occurrence of CVD events with a fast (inference speed) and stable (confidence values) procedure. The combined ML and statistical algorithm of BARTm provide a flexible approach without any need for technical knowledge about assumptions and preprocessing steps of the prediction procedure. BioMed Central 2023-04-19 /pmc/articles/PMC10116769/ /pubmed/37076833 http://dx.doi.org/10.1186/s12911-023-02169-5 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Mehrabani-Zeinabad, Kamran Feizi, Awat Sadeghi, Masoumeh Roohafza, Hamidreza Talaei, Mohammad Sarrafzadegan, Nizal Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region
title	Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region
title_full	Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region
title_fullStr	Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region
title_full_unstemmed	Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region
title_short	Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region
title_sort	cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern mediterranean region
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10116769/ https://www.ncbi.nlm.nih.gov/pubmed/37076833 http://dx.doi.org/10.1186/s12911-023-02169-5
work_keys_str_mv	AT mehrabanizeinabadkamran cardiovasculardiseaseincidencepredictionbymachinelearningandstatisticaltechniquesa16yearcohortstudyfromeasternmediterraneanregion AT feiziawat cardiovasculardiseaseincidencepredictionbymachinelearningandstatisticaltechniquesa16yearcohortstudyfromeasternmediterraneanregion AT sadeghimasoumeh cardiovasculardiseaseincidencepredictionbymachinelearningandstatisticaltechniquesa16yearcohortstudyfromeasternmediterraneanregion AT roohafzahamidreza cardiovasculardiseaseincidencepredictionbymachinelearningandstatisticaltechniquesa16yearcohortstudyfromeasternmediterraneanregion AT talaeimohammad cardiovasculardiseaseincidencepredictionbymachinelearningandstatisticaltechniquesa16yearcohortstudyfromeasternmediterraneanregion AT sarrafzadegannizal cardiovasculardiseaseincidencepredictionbymachinelearningandstatisticaltechniquesa16yearcohortstudyfromeasternmediterraneanregion

Cardiovascular disease incidence prediction by machine learning and statistical techniques: a 16-year cohort study from eastern Mediterranean region

Ejemplares similares