Cargando…

A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

BACKGROUND: Diabetes and cardiovascular disease are two of the main causes of death in the United States. Identifying and predicting these diseases in patients is the first step towards stopping their progression. We evaluate the capabilities of machine learning models in detecting at-risk patients...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dinh, An, Miertschin, Stacey, Young, Amber, Mohanty, Somya D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6836338/ https://www.ncbi.nlm.nih.gov/pubmed/31694707 http://dx.doi.org/10.1186/s12911-019-0918-5

_version_	1783466883530358784
author	Dinh, An Miertschin, Stacey Young, Amber Mohanty, Somya D.
author_facet	Dinh, An Miertschin, Stacey Young, Amber Mohanty, Somya D.
author_sort	Dinh, An
collection	PubMed
description	BACKGROUND: Diabetes and cardiovascular disease are two of the main causes of death in the United States. Identifying and predicting these diseases in patients is the first step towards stopping their progression. We evaluate the capabilities of machine learning models in detecting at-risk patients using survey data (and laboratory results), and identify key variables within the data contributing to these diseases among the patients. METHODS: Our research explores data-driven approaches which utilize supervised machine learning models to identify patients with such diseases. Using the National Health and Nutrition Examination Survey (NHANES) dataset, we conduct an exhaustive search of all available feature variables within the data to develop models for cardiovascular, prediabetes, and diabetes detection. Using different time-frames and feature sets for the data (based on laboratory data), multiple machine learning models (logistic regression, support vector machines, random forest, and gradient boosting) were evaluated on their classification performance. The models were then combined to develop a weighted ensemble model, capable of leveraging the performance of the disparate models to improve detection accuracy. Information gain of tree-based models was used to identify the key variables within the patient data that contributed to the detection of at-risk patients in each of the diseases classes by the data-learned models. RESULTS: The developed ensemble model for cardiovascular disease (based on 131 variables) achieved an Area Under - Receiver Operating Characteristics (AU-ROC) score of 83.1% using no laboratory results, and 83.9% accuracy with laboratory results. In diabetes classification (based on 123 variables), eXtreme Gradient Boost (XGBoost) model achieved an AU-ROC score of 86.2% (without laboratory data) and 95.7% (with laboratory data). For pre-diabetic patients, the ensemble model had the top AU-ROC score of 73.7% (without laboratory data), and for laboratory based data XGBoost performed the best at 84.4%. Top five predictors in diabetes patients were 1) waist size, 2) age, 3) self-reported weight, 4) leg length, and 5) sodium intake. For cardiovascular diseases the models identified 1) age, 2) systolic blood pressure, 3) self-reported weight, 4) occurrence of chest pain, and 5) diastolic blood pressure as key contributors. CONCLUSION: We conclude machine learned models based on survey questionnaire can provide an automated identification mechanism for patients at risk of diabetes and cardiovascular diseases. We also identify key contributors to the prediction, which can be further explored for their implications on electronic health records.
format	Online Article Text
id	pubmed-6836338
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-68363382019-11-08 A data-driven approach to predicting diabetes and cardiovascular disease with machine learning Dinh, An Miertschin, Stacey Young, Amber Mohanty, Somya D. BMC Med Inform Decis Mak Research Article BACKGROUND: Diabetes and cardiovascular disease are two of the main causes of death in the United States. Identifying and predicting these diseases in patients is the first step towards stopping their progression. We evaluate the capabilities of machine learning models in detecting at-risk patients using survey data (and laboratory results), and identify key variables within the data contributing to these diseases among the patients. METHODS: Our research explores data-driven approaches which utilize supervised machine learning models to identify patients with such diseases. Using the National Health and Nutrition Examination Survey (NHANES) dataset, we conduct an exhaustive search of all available feature variables within the data to develop models for cardiovascular, prediabetes, and diabetes detection. Using different time-frames and feature sets for the data (based on laboratory data), multiple machine learning models (logistic regression, support vector machines, random forest, and gradient boosting) were evaluated on their classification performance. The models were then combined to develop a weighted ensemble model, capable of leveraging the performance of the disparate models to improve detection accuracy. Information gain of tree-based models was used to identify the key variables within the patient data that contributed to the detection of at-risk patients in each of the diseases classes by the data-learned models. RESULTS: The developed ensemble model for cardiovascular disease (based on 131 variables) achieved an Area Under - Receiver Operating Characteristics (AU-ROC) score of 83.1% using no laboratory results, and 83.9% accuracy with laboratory results. In diabetes classification (based on 123 variables), eXtreme Gradient Boost (XGBoost) model achieved an AU-ROC score of 86.2% (without laboratory data) and 95.7% (with laboratory data). For pre-diabetic patients, the ensemble model had the top AU-ROC score of 73.7% (without laboratory data), and for laboratory based data XGBoost performed the best at 84.4%. Top five predictors in diabetes patients were 1) waist size, 2) age, 3) self-reported weight, 4) leg length, and 5) sodium intake. For cardiovascular diseases the models identified 1) age, 2) systolic blood pressure, 3) self-reported weight, 4) occurrence of chest pain, and 5) diastolic blood pressure as key contributors. CONCLUSION: We conclude machine learned models based on survey questionnaire can provide an automated identification mechanism for patients at risk of diabetes and cardiovascular diseases. We also identify key contributors to the prediction, which can be further explored for their implications on electronic health records. BioMed Central 2019-11-06 /pmc/articles/PMC6836338/ /pubmed/31694707 http://dx.doi.org/10.1186/s12911-019-0918-5 Text en © Dinh et al. 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Dinh, An Miertschin, Stacey Young, Amber Mohanty, Somya D. A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
title	A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
title_full	A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
title_fullStr	A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
title_full_unstemmed	A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
title_short	A data-driven approach to predicting diabetes and cardiovascular disease with machine learning
title_sort	data-driven approach to predicting diabetes and cardiovascular disease with machine learning
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6836338/ https://www.ncbi.nlm.nih.gov/pubmed/31694707 http://dx.doi.org/10.1186/s12911-019-0918-5
work_keys_str_mv	AT dinhan adatadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning AT miertschinstacey adatadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning AT youngamber adatadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning AT mohantysomyad adatadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning AT dinhan datadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning AT miertschinstacey datadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning AT youngamber datadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning AT mohantysomyad datadrivenapproachtopredictingdiabetesandcardiovasculardiseasewithmachinelearning

A data-driven approach to predicting diabetes and cardiovascular disease with machine learning

Ejemplares similares