Cargando…

A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification

In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the pre...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mandal, Moumita, Singh, Pawan Kumar, Ijaz, Muhammad Fazal, Shafi, Jana, Sarkar, Ram
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8402295/ https://www.ncbi.nlm.nih.gov/pubmed/34451013 http://dx.doi.org/10.3390/s21165571

_version_	1783745755599601664
author	Mandal, Moumita Singh, Pawan Kumar Ijaz, Muhammad Fazal Shafi, Jana Sarkar, Ram
author_facet	Mandal, Moumita Singh, Pawan Kumar Ijaz, Muhammad Fazal Shafi, Jana Sarkar, Ram
author_sort	Mandal, Moumita
collection	PubMed
description	In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods—Mutual Information, ReliefF, Chi Square, and Xvariance—and then each feature from the union set was assessed by three classification algorithms—support vector machine, naïve Bayes, and k-nearest neighbors—and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost.
format	Online Article Text
id	pubmed-8402295
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-84022952021-08-29 A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification Mandal, Moumita Singh, Pawan Kumar Ijaz, Muhammad Fazal Shafi, Jana Sarkar, Ram Sensors (Basel) Article In machine learning and data science, feature selection is considered as a crucial step of data preprocessing. When we directly apply the raw data for classification or clustering purposes, sometimes we observe that the learning algorithms do not perform well. One possible reason for this is the presence of redundant, noisy, and non-informative features or attributes in the datasets. Hence, feature selection methods are used to identify the subset of relevant features that can maximize the model performance. Moreover, due to reduction in feature dimension, both training time and storage required by the model can be reduced as well. In this paper, we present a tri-stage wrapper-filter-based feature selection framework for the purpose of medical report-based disease detection. In the first stage, an ensemble was formed by four filter methods—Mutual Information, ReliefF, Chi Square, and Xvariance—and then each feature from the union set was assessed by three classification algorithms—support vector machine, naïve Bayes, and k-nearest neighbors—and an average accuracy was calculated. The features with higher accuracy were selected to obtain a preliminary subset of optimal features. In the second stage, Pearson correlation was used to discard highly correlated features. In these two stages, XGBoost classification algorithm was applied to obtain the most contributing features that, in turn, provide the best optimal subset. Then, in the final stage, we fed the obtained feature subset to a meta-heuristic algorithm, called whale optimization algorithm, in order to further reduce the feature set and to achieve higher accuracy. We evaluated the proposed feature selection framework on four publicly available disease datasets taken from the UCI machine learning repository, namely, arrhythmia, leukemia, DLBCL, and prostate cancer. Our obtained results confirm that the proposed method can perform better than many state-of-the-art methods and can detect important features as well. Less features ensure less medical tests for correct diagnosis, thus saving both time and cost. MDPI 2021-08-18 /pmc/articles/PMC8402295/ /pubmed/34451013 http://dx.doi.org/10.3390/s21165571 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Mandal, Moumita Singh, Pawan Kumar Ijaz, Muhammad Fazal Shafi, Jana Sarkar, Ram A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
title	A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
title_full	A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
title_fullStr	A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
title_full_unstemmed	A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
title_short	A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification
title_sort	tri-stage wrapper-filter feature selection framework for disease classification
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8402295/ https://www.ncbi.nlm.nih.gov/pubmed/34451013 http://dx.doi.org/10.3390/s21165571
work_keys_str_mv	AT mandalmoumita atristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT singhpawankumar atristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT ijazmuhammadfazal atristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT shafijana atristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT sarkarram atristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT mandalmoumita tristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT singhpawankumar tristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT ijazmuhammadfazal tristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT shafijana tristagewrapperfilterfeatureselectionframeworkfordiseaseclassification AT sarkarram tristagewrapperfilterfeatureselectionframeworkfordiseaseclassification

A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification

Ejemplares similares