Cargando…

Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest

Nonalcoholic fatty liver disease (NAFLD) is the hepatic manifestation of metabolic syndrome and is the most common cause of chronic liver disease in developed countries. Certain conditions, including mild inflammation biomarkers, dyslipidemia, and insulin resistance, can trigger a progression to non...

Descripción completa

Detalles Bibliográficos
Autores principales: García-Carretero, Rafael, Holgado-Cuadrado, Roberto, Barquero-Pérez, Óscar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8234908/
https://www.ncbi.nlm.nih.gov/pubmed/34204225
http://dx.doi.org/10.3390/e23060763
_version_ 1783714192159670272
author García-Carretero, Rafael
Holgado-Cuadrado, Roberto
Barquero-Pérez, Óscar
author_facet García-Carretero, Rafael
Holgado-Cuadrado, Roberto
Barquero-Pérez, Óscar
author_sort García-Carretero, Rafael
collection PubMed
description Nonalcoholic fatty liver disease (NAFLD) is the hepatic manifestation of metabolic syndrome and is the most common cause of chronic liver disease in developed countries. Certain conditions, including mild inflammation biomarkers, dyslipidemia, and insulin resistance, can trigger a progression to nonalcoholic steatohepatitis (NASH), a condition characterized by inflammation and liver cell damage. We demonstrate the usefulness of machine learning with a case study to analyze the most important features in random forest (RF) models for predicting patients at risk of developing NASH. We collected data from patients who attended the Cardiovascular Risk Unit of Mostoles University Hospital (Madrid, Spain) from 2005 to 2021. We reviewed electronic health records to assess the presence of NASH, which was used as the outcome. We chose RF as the algorithm to develop six models using different pre-processing strategies. The performance metrics was evaluated to choose an optimized model. Finally, several interpretability techniques, such as feature importance, contribution of each feature to predictions, and partial dependence plots, were used to understand and explain the model to help obtain a better understanding of machine learning-based predictions. In total, 1525 patients met the inclusion criteria. The mean age was 57.3 years, and 507 patients had NASH (prevalence of 33.2%). Filter methods (the chi-square and Mann–Whitney–Wilcoxon tests) did not produce additional insight in terms of interactions, contributions, or relationships among variables and their outcomes. The random forest model correctly classified patients with NASH to an accuracy of 0.87 in the best model and to 0.79 in the worst one. Four features were the most relevant: insulin resistance, ferritin, serum levels of insulin, and triglycerides. The contribution of each feature was assessed via partial dependence plots. Random forest-based modeling demonstrated that machine learning can be used to improve interpretability, produce understanding of the modeled behavior, and demonstrate how far certain features can contribute to predictions.
format Online
Article
Text
id pubmed-8234908
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-82349082021-06-27 Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest García-Carretero, Rafael Holgado-Cuadrado, Roberto Barquero-Pérez, Óscar Entropy (Basel) Article Nonalcoholic fatty liver disease (NAFLD) is the hepatic manifestation of metabolic syndrome and is the most common cause of chronic liver disease in developed countries. Certain conditions, including mild inflammation biomarkers, dyslipidemia, and insulin resistance, can trigger a progression to nonalcoholic steatohepatitis (NASH), a condition characterized by inflammation and liver cell damage. We demonstrate the usefulness of machine learning with a case study to analyze the most important features in random forest (RF) models for predicting patients at risk of developing NASH. We collected data from patients who attended the Cardiovascular Risk Unit of Mostoles University Hospital (Madrid, Spain) from 2005 to 2021. We reviewed electronic health records to assess the presence of NASH, which was used as the outcome. We chose RF as the algorithm to develop six models using different pre-processing strategies. The performance metrics was evaluated to choose an optimized model. Finally, several interpretability techniques, such as feature importance, contribution of each feature to predictions, and partial dependence plots, were used to understand and explain the model to help obtain a better understanding of machine learning-based predictions. In total, 1525 patients met the inclusion criteria. The mean age was 57.3 years, and 507 patients had NASH (prevalence of 33.2%). Filter methods (the chi-square and Mann–Whitney–Wilcoxon tests) did not produce additional insight in terms of interactions, contributions, or relationships among variables and their outcomes. The random forest model correctly classified patients with NASH to an accuracy of 0.87 in the best model and to 0.79 in the worst one. Four features were the most relevant: insulin resistance, ferritin, serum levels of insulin, and triglycerides. The contribution of each feature was assessed via partial dependence plots. Random forest-based modeling demonstrated that machine learning can be used to improve interpretability, produce understanding of the modeled behavior, and demonstrate how far certain features can contribute to predictions. MDPI 2021-06-17 /pmc/articles/PMC8234908/ /pubmed/34204225 http://dx.doi.org/10.3390/e23060763 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
García-Carretero, Rafael
Holgado-Cuadrado, Roberto
Barquero-Pérez, Óscar
Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_full Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_fullStr Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_full_unstemmed Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_short Assessment of Classification Models and Relevant Features on Nonalcoholic Steatohepatitis Using Random Forest
title_sort assessment of classification models and relevant features on nonalcoholic steatohepatitis using random forest
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8234908/
https://www.ncbi.nlm.nih.gov/pubmed/34204225
http://dx.doi.org/10.3390/e23060763
work_keys_str_mv AT garciacarreterorafael assessmentofclassificationmodelsandrelevantfeaturesonnonalcoholicsteatohepatitisusingrandomforest
AT holgadocuadradoroberto assessmentofclassificationmodelsandrelevantfeaturesonnonalcoholicsteatohepatitisusingrandomforest
AT barqueroperezoscar assessmentofclassificationmodelsandrelevantfeaturesonnonalcoholicsteatohepatitisusingrandomforest