Cargando…

A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning

Nowadays, the solution to many practical problems relies on machine learning tools. However, compiling the appropriate training data set for real-world classification problems is challenging because collecting the right amount of data for each class is often difficult or even impossible. In such cas...

Descripción completa

Detalles Bibliográficos
Autores principales:	Szeghalmy, Szilvia, Fazekas, Attila
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9967638/ https://www.ncbi.nlm.nih.gov/pubmed/36850931 http://dx.doi.org/10.3390/s23042333

_version_	1784897314852175872
author	Szeghalmy, Szilvia Fazekas, Attila
author_facet	Szeghalmy, Szilvia Fazekas, Attila
author_sort	Szeghalmy, Szilvia
collection	PubMed
description	Nowadays, the solution to many practical problems relies on machine learning tools. However, compiling the appropriate training data set for real-world classification problems is challenging because collecting the right amount of data for each class is often difficult or even impossible. In such cases, we can easily face the problem of imbalanced learning. There are many methods in the literature for solving the imbalanced learning problem, so it has become a serious question how to compare the performance of the imbalanced learning methods. Inadequate validation techniques can provide misleading results (e.g., due to data shift), which leads to the development of methods designed for imbalanced data sets, such as stratified cross-validation (SCV) and distribution optimally balanced SCV (DOB-SCV). Previous studies have shown that higher classification performance scores (AUC) can be achieved on imbalanced data sets using DOB-SCV instead of SCV. We investigated the effect of the oversamplers on this difference. The study was conducted on 420 data sets, involving several sampling methods and the DTree, kNN, SVM, and MLP classifiers. We point out that DOB-SCV often provides a little higher F1 and AUC values for classification combined with sampling. However, the results also prove that the selection of the sampler–classifier pair is more important for the classification performance than the choice between the DOB-SCV and the SCV techniques.
format	Online Article Text
id	pubmed-9967638
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-99676382023-02-27 A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning Szeghalmy, Szilvia Fazekas, Attila Sensors (Basel) Article Nowadays, the solution to many practical problems relies on machine learning tools. However, compiling the appropriate training data set for real-world classification problems is challenging because collecting the right amount of data for each class is often difficult or even impossible. In such cases, we can easily face the problem of imbalanced learning. There are many methods in the literature for solving the imbalanced learning problem, so it has become a serious question how to compare the performance of the imbalanced learning methods. Inadequate validation techniques can provide misleading results (e.g., due to data shift), which leads to the development of methods designed for imbalanced data sets, such as stratified cross-validation (SCV) and distribution optimally balanced SCV (DOB-SCV). Previous studies have shown that higher classification performance scores (AUC) can be achieved on imbalanced data sets using DOB-SCV instead of SCV. We investigated the effect of the oversamplers on this difference. The study was conducted on 420 data sets, involving several sampling methods and the DTree, kNN, SVM, and MLP classifiers. We point out that DOB-SCV often provides a little higher F1 and AUC values for classification combined with sampling. However, the results also prove that the selection of the sampler–classifier pair is more important for the classification performance than the choice between the DOB-SCV and the SCV techniques. MDPI 2023-02-20 /pmc/articles/PMC9967638/ /pubmed/36850931 http://dx.doi.org/10.3390/s23042333 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Szeghalmy, Szilvia Fazekas, Attila A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning
title	A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning
title_full	A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning
title_fullStr	A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning
title_full_unstemmed	A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning
title_short	A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning
title_sort	comparative study of the use of stratified cross-validation and distribution-balanced stratified cross-validation in imbalanced learning
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9967638/ https://www.ncbi.nlm.nih.gov/pubmed/36850931 http://dx.doi.org/10.3390/s23042333
work_keys_str_mv	AT szeghalmyszilvia acomparativestudyoftheuseofstratifiedcrossvalidationanddistributionbalancedstratifiedcrossvalidationinimbalancedlearning AT fazekasattila acomparativestudyoftheuseofstratifiedcrossvalidationanddistributionbalancedstratifiedcrossvalidationinimbalancedlearning AT szeghalmyszilvia comparativestudyoftheuseofstratifiedcrossvalidationanddistributionbalancedstratifiedcrossvalidationinimbalancedlearning AT fazekasattila comparativestudyoftheuseofstratifiedcrossvalidationanddistributionbalancedstratifiedcrossvalidationinimbalancedlearning

A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning

Ejemplares similares