Cargando…
Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment
Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We pr...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer US
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9463682/ https://www.ncbi.nlm.nih.gov/pubmed/36105649 http://dx.doi.org/10.1007/s10586-022-03725-w |
_version_ | 1784787441443405824 |
---|---|
author | Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha |
author_facet | Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha |
author_sort | Vivek, Yelleti |
collection | PubMed |
description | Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS(PM)), and named them PB-ADE and P-DE-FS(PM) respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality. |
format | Online Article Text |
id | pubmed-9463682 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer US |
record_format | MEDLINE/PubMed |
spelling | pubmed-94636822022-09-10 Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha Cluster Comput Article Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS(PM)), and named them PB-ADE and P-DE-FS(PM) respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality. Springer US 2022-09-10 2023 /pmc/articles/PMC9463682/ /pubmed/36105649 http://dx.doi.org/10.1007/s10586-022-03725-w Text en © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Vivek, Yelleti Ravi, Vadlamani Krishna, P. Radha Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment |
title | Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment |
title_full | Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment |
title_fullStr | Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment |
title_full_unstemmed | Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment |
title_short | Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment |
title_sort | scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9463682/ https://www.ncbi.nlm.nih.gov/pubmed/36105649 http://dx.doi.org/10.1007/s10586-022-03725-w |
work_keys_str_mv | AT vivekyelleti scalablefeaturesubsetselectionforbigdatausingparallelhybridevolutionaryalgorithmbasedwrapperunderapachesparkenvironment AT ravivadlamani scalablefeaturesubsetselectionforbigdatausingparallelhybridevolutionaryalgorithmbasedwrapperunderapachesparkenvironment AT krishnapradha scalablefeaturesubsetselectionforbigdatausingparallelhybridevolutionaryalgorithmbasedwrapperunderapachesparkenvironment |