Cargando…

Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling whic...

Descripción completa

Detalles Bibliográficos
Autores principales: Nadeem, Khurram, Jabri, Mehdi-Abderrahman
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9844919/
https://www.ncbi.nlm.nih.gov/pubmed/36649281
http://dx.doi.org/10.1371/journal.pone.0280258
_version_ 1784870767653027840
author Nadeem, Khurram
Jabri, Mehdi-Abderrahman
author_facet Nadeem, Khurram
Jabri, Mehdi-Abderrahman
author_sort Nadeem, Khurram
collection PubMed
description We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.
format Online
Article
Text
id pubmed-9844919
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-98449192023-01-18 Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data Nadeem, Khurram Jabri, Mehdi-Abderrahman PLoS One Research Article We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data. Public Library of Science 2023-01-17 /pmc/articles/PMC9844919/ /pubmed/36649281 http://dx.doi.org/10.1371/journal.pone.0280258 Text en © 2023 Nadeem, Jabri https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Nadeem, Khurram
Jabri, Mehdi-Abderrahman
Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
title Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
title_full Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
title_fullStr Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
title_full_unstemmed Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
title_short Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
title_sort stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9844919/
https://www.ncbi.nlm.nih.gov/pubmed/36649281
http://dx.doi.org/10.1371/journal.pone.0280258
work_keys_str_mv AT nadeemkhurram stablevariablerankingandselectioninregularizedlogisticregressionforseverelyimbalancedbigbinarydata
AT jabrimehdiabderrahman stablevariablerankingandselectioninregularizedlogisticregressionforseverelyimbalancedbigbinarydata