Cargando…

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

BACKGROUND: The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve...

Descripción completa

Detalles Bibliográficos
Autores principales: Ma, Li, Fan, Suohai
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5351181/
https://www.ncbi.nlm.nih.gov/pubmed/28292263
http://dx.doi.org/10.1186/s12859-017-1578-z
_version_ 1782514724755210240
author Ma, Li
Fan, Suohai
author_facet Ma, Li
Fan, Suohai
author_sort Ma, Li
collection PubMed
description BACKGROUND: The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. RESULTS: We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. CONCLUSION: The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization.
format Online
Article
Text
id pubmed-5351181
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53511812017-03-17 CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests Ma, Li Fan, Suohai BMC Bioinformatics Research Article BACKGROUND: The random forests algorithm is a type of classifier with prominent universality, a wide application range, and robustness for avoiding overfitting. But there are still some drawbacks to random forests. Therefore, to improve the performance of random forests, this paper seeks to improve imbalanced data processing, feature selection and parameter optimization. RESULTS: We propose the CURE-SMOTE algorithm for the imbalanced data classification problem. Experiments on imbalanced UCI data reveal that the combination of Clustering Using Representatives (CURE) enhances the original synthetic minority oversampling technique (SMOTE) algorithms effectively compared with the classification results on the original data using random sampling, Borderline-SMOTE1, safe-level SMOTE, C-SMOTE, and k-means-SMOTE. Additionally, the hybrid RF (random forests) algorithm has been proposed for feature selection and parameter optimization, which uses the minimum out of bag (OOB) data error as its objective function. Simulation results on binary and higher-dimensional data indicate that the proposed hybrid RF algorithms, hybrid genetic-random forests algorithm, hybrid particle swarm-random forests algorithm and hybrid fish swarm-random forests algorithm can achieve the minimum OOB error and show the best generalization ability. CONCLUSION: The training set produced from the proposed CURE-SMOTE algorithm is closer to the original data distribution because it contains minimal noise. Thus, better classification results are produced from this feasible and effective algorithm. Moreover, the hybrid algorithm's F-value, G-mean, AUC and OOB scores demonstrate that they surpass the performance of the original RF algorithm. Hence, this hybrid algorithm provides a new way to perform feature selection and parameter optimization. BioMed Central 2017-03-14 /pmc/articles/PMC5351181/ /pubmed/28292263 http://dx.doi.org/10.1186/s12859-017-1578-z Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Ma, Li
Fan, Suohai
CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
title CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
title_full CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
title_fullStr CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
title_full_unstemmed CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
title_short CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
title_sort cure-smote algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5351181/
https://www.ncbi.nlm.nih.gov/pubmed/28292263
http://dx.doi.org/10.1186/s12859-017-1578-z
work_keys_str_mv AT mali curesmotealgorithmandhybridalgorithmforfeatureselectionandparameteroptimizationbasedonrandomforests
AT fansuohai curesmotealgorithmandhybridalgorithmforfeatureselectionandparameteroptimizationbasedonrandomforests