Cargando…

An empirical evaluation of sampling methods for the classification of imbalanced data

In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Misuk, Hwang, Kyu-Baek
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9333262/ https://www.ncbi.nlm.nih.gov/pubmed/35901023 http://dx.doi.org/10.1371/journal.pone.0271260

_version_	1784758834725650432
author	Kim, Misuk Hwang, Kyu-Baek
author_facet	Kim, Misuk Hwang, Kyu-Baek
author_sort	Kim, Misuk
collection	PubMed
description	In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.
format	Online Article Text
id	pubmed-9333262
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-93332622022-07-29 An empirical evaluation of sampling methods for the classification of imbalanced data Kim, Misuk Hwang, Kyu-Baek PLoS One Research Article In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification. Public Library of Science 2022-07-28 /pmc/articles/PMC9333262/ /pubmed/35901023 http://dx.doi.org/10.1371/journal.pone.0271260 Text en © 2022 Misuk, Kyu-Baek https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Kim, Misuk Hwang, Kyu-Baek An empirical evaluation of sampling methods for the classification of imbalanced data
title	An empirical evaluation of sampling methods for the classification of imbalanced data
title_full	An empirical evaluation of sampling methods for the classification of imbalanced data
title_fullStr	An empirical evaluation of sampling methods for the classification of imbalanced data
title_full_unstemmed	An empirical evaluation of sampling methods for the classification of imbalanced data
title_short	An empirical evaluation of sampling methods for the classification of imbalanced data
title_sort	empirical evaluation of sampling methods for the classification of imbalanced data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9333262/ https://www.ncbi.nlm.nih.gov/pubmed/35901023 http://dx.doi.org/10.1371/journal.pone.0271260
work_keys_str_mv	AT kimmisuk anempiricalevaluationofsamplingmethodsfortheclassificationofimbalanceddata AT hwangkyubaek anempiricalevaluationofsamplingmethodsfortheclassificationofimbalanceddata AT kimmisuk empiricalevaluationofsamplingmethodsfortheclassificationofimbalanceddata AT hwangkyubaek empiricalevaluationofsamplingmethodsfortheclassificationofimbalanceddata

An empirical evaluation of sampling methods for the classification of imbalanced data

Ejemplares similares