Cargando…

Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment

Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We pr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Vivek, Yelleti, Ravi, Vadlamani, Krishna, P. Radha
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer US 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9463682/ https://www.ncbi.nlm.nih.gov/pubmed/36105649 http://dx.doi.org/10.1007/s10586-022-03725-w

_version_	1784787441443405824
author	Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha
author_facet	Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha
author_sort	Vivek, Yelleti
collection	PubMed
description	Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS(PM)), and named them PB-ADE and P-DE-FS(PM) respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality.
format	Online Article Text
id	pubmed-9463682
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer US
record_format	MEDLINE/PubMed
spelling	pubmed-94636822022-09-10 Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha Cluster Comput Article Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS(PM)), and named them PB-ADE and P-DE-FS(PM) respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality. Springer US 2022-09-10 2023 /pmc/articles/PMC9463682/ /pubmed/36105649 http://dx.doi.org/10.1007/s10586-022-03725-w Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
title	Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
title_full	Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
title_fullStr	Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
title_full_unstemmed	Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
title_short	Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
title_sort	scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9463682/ https://www.ncbi.nlm.nih.gov/pubmed/36105649 http://dx.doi.org/10.1007/s10586-022-03725-w
work_keys_str_mv	AT vivekyelleti scalablefeaturesubsetselectionforbigdatausingparallelhybridevolutionaryalgorithmbasedwrapperunderapachesparkenvironment AT ravivadlamani scalablefeaturesubsetselectionforbigdatausingparallelhybridevolutionaryalgorithmbasedwrapperunderapachesparkenvironment AT krishnapradha scalablefeaturesubsetselectionforbigdatausingparallelhybridevolutionaryalgorithmbasedwrapperunderapachesparkenvironment

Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment

Ejemplares similares