Cargando…

Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence

Imbalanced data, a common challenge encountered in statistical analyses of clinical trial datasets and disease modeling, refers to the scenario where one class significantly outnumbers the other in a binary classification problem. This imbalance can lead to biased model performance, favoring the maj...

Descripción completa

Detalles Bibliográficos
Autores principales: Meysami, Mohammad, Kumar, Vijay, Pugh, McKayah, Lowery, Samuel Thomas, Sur, Shantanu, Mondal, Sumona, Greene, James M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569817/
https://www.ncbi.nlm.nih.gov/pubmed/37841430
http://dx.doi.org/10.3389/fonc.2023.1227842
_version_ 1785119627769020416
author Meysami, Mohammad
Kumar, Vijay
Pugh, McKayah
Lowery, Samuel Thomas
Sur, Shantanu
Mondal, Sumona
Greene, James M.
author_facet Meysami, Mohammad
Kumar, Vijay
Pugh, McKayah
Lowery, Samuel Thomas
Sur, Shantanu
Mondal, Sumona
Greene, James M.
author_sort Meysami, Mohammad
collection PubMed
description Imbalanced data, a common challenge encountered in statistical analyses of clinical trial datasets and disease modeling, refers to the scenario where one class significantly outnumbers the other in a binary classification problem. This imbalance can lead to biased model performance, favoring the majority class, and affecting the understanding of the relative importance of predictive variables. Despite its prevalence, the existing literature lacks comprehensive studies that elucidate methodologies to handle imbalanced data effectively. In this study, we discuss the binary logistic model and its limitations when dealing with imbalanced data, as model performance tends to be biased towards the majority class. We propose a novel approach to addressing imbalanced data and apply it to publicly available data from the VITAL trial, a large-scale clinical trial that examines the effects of vitamin D and Omega-3 fatty acid to investigate the relationship between vitamin D and cancer incidence in sub-populations based on race/ethnicity and demographic factors such as body mass index (BMI), age, and sex. Our results demonstrate a significant improvement in model performance after our undersampling method is applied to the data set with respect to cancer incidence prediction. Both epidemiological and laboratory studies have suggested that vitamin D may lower the occurrence and death rate of cancer, but inconsistent and conflicting findings have been reported due to the difficulty of conducting large-scale clinical trials. We also utilize logistic regression within each ethnic sub-population to determine the impact of demographic factors on cancer incidence, with a particular focus on the role of vitamin D. This study provides a framework for using classification models to understand relative variable importance when dealing with imbalanced data.
format Online
Article
Text
id pubmed-10569817
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-105698172023-10-13 Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence Meysami, Mohammad Kumar, Vijay Pugh, McKayah Lowery, Samuel Thomas Sur, Shantanu Mondal, Sumona Greene, James M. Front Oncol Oncology Imbalanced data, a common challenge encountered in statistical analyses of clinical trial datasets and disease modeling, refers to the scenario where one class significantly outnumbers the other in a binary classification problem. This imbalance can lead to biased model performance, favoring the majority class, and affecting the understanding of the relative importance of predictive variables. Despite its prevalence, the existing literature lacks comprehensive studies that elucidate methodologies to handle imbalanced data effectively. In this study, we discuss the binary logistic model and its limitations when dealing with imbalanced data, as model performance tends to be biased towards the majority class. We propose a novel approach to addressing imbalanced data and apply it to publicly available data from the VITAL trial, a large-scale clinical trial that examines the effects of vitamin D and Omega-3 fatty acid to investigate the relationship between vitamin D and cancer incidence in sub-populations based on race/ethnicity and demographic factors such as body mass index (BMI), age, and sex. Our results demonstrate a significant improvement in model performance after our undersampling method is applied to the data set with respect to cancer incidence prediction. Both epidemiological and laboratory studies have suggested that vitamin D may lower the occurrence and death rate of cancer, but inconsistent and conflicting findings have been reported due to the difficulty of conducting large-scale clinical trials. We also utilize logistic regression within each ethnic sub-population to determine the impact of demographic factors on cancer incidence, with a particular focus on the role of vitamin D. This study provides a framework for using classification models to understand relative variable importance when dealing with imbalanced data. Frontiers Media S.A. 2023-09-28 /pmc/articles/PMC10569817/ /pubmed/37841430 http://dx.doi.org/10.3389/fonc.2023.1227842 Text en Copyright © 2023 Meysami, Kumar, Pugh, Lowery, Sur, Mondal and Greene https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Oncology
Meysami, Mohammad
Kumar, Vijay
Pugh, McKayah
Lowery, Samuel Thomas
Sur, Shantanu
Mondal, Sumona
Greene, James M.
Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence
title Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence
title_full Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence
title_fullStr Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence
title_full_unstemmed Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence
title_short Utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin D and cancer incidence
title_sort utilizing logistic regression to compare risk factors in disease modeling with imbalanced data: a case study in vitamin d and cancer incidence
topic Oncology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10569817/
https://www.ncbi.nlm.nih.gov/pubmed/37841430
http://dx.doi.org/10.3389/fonc.2023.1227842
work_keys_str_mv AT meysamimohammad utilizinglogisticregressiontocompareriskfactorsindiseasemodelingwithimbalanceddataacasestudyinvitamindandcancerincidence
AT kumarvijay utilizinglogisticregressiontocompareriskfactorsindiseasemodelingwithimbalanceddataacasestudyinvitamindandcancerincidence
AT pughmckayah utilizinglogisticregressiontocompareriskfactorsindiseasemodelingwithimbalanceddataacasestudyinvitamindandcancerincidence
AT lowerysamuelthomas utilizinglogisticregressiontocompareriskfactorsindiseasemodelingwithimbalanceddataacasestudyinvitamindandcancerincidence
AT surshantanu utilizinglogisticregressiontocompareriskfactorsindiseasemodelingwithimbalanceddataacasestudyinvitamindandcancerincidence
AT mondalsumona utilizinglogisticregressiontocompareriskfactorsindiseasemodelingwithimbalanceddataacasestudyinvitamindandcancerincidence
AT greenejamesm utilizinglogisticregressiontocompareriskfactorsindiseasemodelingwithimbalanceddataacasestudyinvitamindandcancerincidence