Cargando…

Predicting disease risks from highly imbalanced data using random forest

BACKGROUND: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health co...

Descripción completa

Detalles Bibliográficos
Autores principales:	Khalilia, Mohammed, Chakraborty, Sounak, Popescu, Mihail
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/ https://www.ncbi.nlm.nih.gov/pubmed/21801360 http://dx.doi.org/10.1186/1472-6947-11-51

_version_	1782210921819537408
author	Khalilia, Mohammed Chakraborty, Sounak Popescu, Mihail
author_facet	Khalilia, Mohammed Chakraborty, Sounak Popescu, Mihail
author_sort	Khalilia, Mohammed
collection	PubMed
description	BACKGROUND: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. METHODS: We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. RESULTS: We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. CONCLUSIONS: In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
format	Online Article Text
id	pubmed-3163175
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31631752011-08-29 Predicting disease risks from highly imbalanced data using random forest Khalilia, Mohammed Chakraborty, Sounak Popescu, Mihail BMC Med Inform Decis Mak Research Article BACKGROUND: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. METHODS: We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. RESULTS: We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. CONCLUSIONS: In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%. BioMed Central 2011-07-29 /pmc/articles/PMC3163175/ /pubmed/21801360 http://dx.doi.org/10.1186/1472-6947-11-51 Text en Copyright ©2011 Khalilia et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Khalilia, Mohammed Chakraborty, Sounak Popescu, Mihail Predicting disease risks from highly imbalanced data using random forest
title	Predicting disease risks from highly imbalanced data using random forest
title_full	Predicting disease risks from highly imbalanced data using random forest
title_fullStr	Predicting disease risks from highly imbalanced data using random forest
title_full_unstemmed	Predicting disease risks from highly imbalanced data using random forest
title_short	Predicting disease risks from highly imbalanced data using random forest
title_sort	predicting disease risks from highly imbalanced data using random forest
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/ https://www.ncbi.nlm.nih.gov/pubmed/21801360 http://dx.doi.org/10.1186/1472-6947-11-51
work_keys_str_mv	AT khaliliamohammed predictingdiseaserisksfromhighlyimbalanceddatausingrandomforest AT chakrabortysounak predictingdiseaserisksfromhighlyimbalanceddatausingrandomforest AT popescumihail predictingdiseaserisksfromhighlyimbalanceddatausingrandomforest

Predicting disease risks from highly imbalanced data using random forest

Ejemplares similares