Cargando…

Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-wor...

Descripción completa

Detalles Bibliográficos
Autores principales: Abdulrauf Sharifai, Garba, Zainol, Zurinahni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7397300/
https://www.ncbi.nlm.nih.gov/pubmed/32605144
http://dx.doi.org/10.3390/genes11070717
_version_ 1783565749315436544
author Abdulrauf Sharifai, Garba
Zainol, Zurinahni
author_facet Abdulrauf Sharifai, Garba
Zainol, Zurinahni
author_sort Abdulrauf Sharifai, Garba
collection PubMed
description The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.
format Online
Article
Text
id pubmed-7397300
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-73973002020-08-16 Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm Abdulrauf Sharifai, Garba Zainol, Zurinahni Genes (Basel) Article The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics. MDPI 2020-06-27 /pmc/articles/PMC7397300/ /pubmed/32605144 http://dx.doi.org/10.3390/genes11070717 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Abdulrauf Sharifai, Garba
Zainol, Zurinahni
Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm
title Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm
title_full Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm
title_fullStr Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm
title_full_unstemmed Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm
title_short Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm
title_sort feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7397300/
https://www.ncbi.nlm.nih.gov/pubmed/32605144
http://dx.doi.org/10.3390/genes11070717
work_keys_str_mv AT abdulraufsharifaigarba featureselectionforhighdimensionalandimbalancedbiomedicaldatabasedonrobustcorrelationbasedredundancyandbinarygrasshopperoptimizationalgorithm
AT zainolzurinahni featureselectionforhighdimensionalandimbalancedbiomedicaldatabasedonrobustcorrelationbasedredundancyandbinarygrasshopperoptimizationalgorithm