Cargando…

On the Performance of Oversampling Techniques for Class Imbalance Problems

Although over 90 oversampling approaches have been developed in the imbalance learning domain, most of the empirical study and application work are still based on the “classical” resampling techniques. In this paper, several experiments on 19 benchmark datasets are set up to study the efficiency of...

Descripción completa

Detalles Bibliográficos
Autores principales: Kong, Jiawen, Rios, Thiago, Kowalczyk, Wojtek, Menzel, Stefan, Bäck, Thomas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206329/
http://dx.doi.org/10.1007/978-3-030-47436-2_7
_version_ 1783530394107248640
author Kong, Jiawen
Rios, Thiago
Kowalczyk, Wojtek
Menzel, Stefan
Bäck, Thomas
author_facet Kong, Jiawen
Rios, Thiago
Kowalczyk, Wojtek
Menzel, Stefan
Bäck, Thomas
author_sort Kong, Jiawen
collection PubMed
description Although over 90 oversampling approaches have been developed in the imbalance learning domain, most of the empirical study and application work are still based on the “classical” resampling techniques. In this paper, several experiments on 19 benchmark datasets are set up to study the efficiency of six powerful oversampling approaches, including both “classical” and new ones. According to our experimental results, oversampling techniques that consider the minority class distribution (new ones) perform better in most cases and RACOG gives the best performance among the six reviewed approaches. We further validate our conclusion on our real-world inspired vehicle datasets and also find applying oversampling techniques can improve the performance by around 10%. In addition, seven data complexity measures are considered for the initial purpose of investigating the relationship between data complexity measures and the choice of resampling techniques. Although no obvious relationship can be abstracted in our experiments, we find F1v value, a measure for evaluating the overlap which most researchers ignore, has a strong negative correlation with the potential AUC value (after resampling).
format Online
Article
Text
id pubmed-7206329
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72063292020-05-08 On the Performance of Oversampling Techniques for Class Imbalance Problems Kong, Jiawen Rios, Thiago Kowalczyk, Wojtek Menzel, Stefan Bäck, Thomas Advances in Knowledge Discovery and Data Mining Article Although over 90 oversampling approaches have been developed in the imbalance learning domain, most of the empirical study and application work are still based on the “classical” resampling techniques. In this paper, several experiments on 19 benchmark datasets are set up to study the efficiency of six powerful oversampling approaches, including both “classical” and new ones. According to our experimental results, oversampling techniques that consider the minority class distribution (new ones) perform better in most cases and RACOG gives the best performance among the six reviewed approaches. We further validate our conclusion on our real-world inspired vehicle datasets and also find applying oversampling techniques can improve the performance by around 10%. In addition, seven data complexity measures are considered for the initial purpose of investigating the relationship between data complexity measures and the choice of resampling techniques. Although no obvious relationship can be abstracted in our experiments, we find F1v value, a measure for evaluating the overlap which most researchers ignore, has a strong negative correlation with the potential AUC value (after resampling). 2020-04-17 /pmc/articles/PMC7206329/ http://dx.doi.org/10.1007/978-3-030-47436-2_7 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Kong, Jiawen
Rios, Thiago
Kowalczyk, Wojtek
Menzel, Stefan
Bäck, Thomas
On the Performance of Oversampling Techniques for Class Imbalance Problems
title On the Performance of Oversampling Techniques for Class Imbalance Problems
title_full On the Performance of Oversampling Techniques for Class Imbalance Problems
title_fullStr On the Performance of Oversampling Techniques for Class Imbalance Problems
title_full_unstemmed On the Performance of Oversampling Techniques for Class Imbalance Problems
title_short On the Performance of Oversampling Techniques for Class Imbalance Problems
title_sort on the performance of oversampling techniques for class imbalance problems
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206329/
http://dx.doi.org/10.1007/978-3-030-47436-2_7
work_keys_str_mv AT kongjiawen ontheperformanceofoversamplingtechniquesforclassimbalanceproblems
AT riosthiago ontheperformanceofoversamplingtechniquesforclassimbalanceproblems
AT kowalczykwojtek ontheperformanceofoversamplingtechniquesforclassimbalanceproblems
AT menzelstefan ontheperformanceofoversamplingtechniquesforclassimbalanceproblems
AT backthomas ontheperformanceofoversamplingtechniquesforclassimbalanceproblems