Cargando…

Optimal selection of resampling methods for imbalanced data with high complexity

Class imbalance is a major problem in classification, wherein the decision boundary is easily biased toward the majority class. A data-level solution (resampling) is one possible solution to this problem. However, several studies have shown that resampling methods can deteriorate the classification...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kim, Annie, Jung, Inkyung
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10374143/ https://www.ncbi.nlm.nih.gov/pubmed/37498823 http://dx.doi.org/10.1371/journal.pone.0288540

_version_	1785078713414582272
author	Kim, Annie Jung, Inkyung
author_facet	Kim, Annie Jung, Inkyung
author_sort	Kim, Annie
collection	PubMed
description	Class imbalance is a major problem in classification, wherein the decision boundary is easily biased toward the majority class. A data-level solution (resampling) is one possible solution to this problem. However, several studies have shown that resampling methods can deteriorate the classification performance. This is because of the overgeneralization problem, which occurs when samples produced by the oversampling technique that should be represented in the minority class domain are introduced into the majority-class domain. This study shows that the overgeneralization problem is aggravated in complex data settings and introduces two alternate approaches to mitigate it. The first approach involves incorporating a filtering method into oversampling. The second approach is to apply undersampling. The main objective of this study is to provide guidance on selecting optimal resampling methods in imbalanced and complex datasets to improve classification performance. Simulation studies and real data analyses were performed to compare the resampling results in various scenarios with different complexities, imbalances, and sample sizes. In the case of noncomplex datasets, undersampling was found to be optimal. However, in the case of complex datasets, applying a filtering method to delete misallocated examples was optimal. In conclusion, this study can aid researchers in selecting the optimal method for resampling complex datasets.
format	Online Article Text
id	pubmed-10374143
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-103741432023-07-28 Optimal selection of resampling methods for imbalanced data with high complexity Kim, Annie Jung, Inkyung PLoS One Research Article Class imbalance is a major problem in classification, wherein the decision boundary is easily biased toward the majority class. A data-level solution (resampling) is one possible solution to this problem. However, several studies have shown that resampling methods can deteriorate the classification performance. This is because of the overgeneralization problem, which occurs when samples produced by the oversampling technique that should be represented in the minority class domain are introduced into the majority-class domain. This study shows that the overgeneralization problem is aggravated in complex data settings and introduces two alternate approaches to mitigate it. The first approach involves incorporating a filtering method into oversampling. The second approach is to apply undersampling. The main objective of this study is to provide guidance on selecting optimal resampling methods in imbalanced and complex datasets to improve classification performance. Simulation studies and real data analyses were performed to compare the resampling results in various scenarios with different complexities, imbalances, and sample sizes. In the case of noncomplex datasets, undersampling was found to be optimal. However, in the case of complex datasets, applying a filtering method to delete misallocated examples was optimal. In conclusion, this study can aid researchers in selecting the optimal method for resampling complex datasets. Public Library of Science 2023-07-27 /pmc/articles/PMC10374143/ /pubmed/37498823 http://dx.doi.org/10.1371/journal.pone.0288540 Text en © 2023 Kim, Jung https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Kim, Annie Jung, Inkyung Optimal selection of resampling methods for imbalanced data with high complexity
title	Optimal selection of resampling methods for imbalanced data with high complexity
title_full	Optimal selection of resampling methods for imbalanced data with high complexity
title_fullStr	Optimal selection of resampling methods for imbalanced data with high complexity
title_full_unstemmed	Optimal selection of resampling methods for imbalanced data with high complexity
title_short	Optimal selection of resampling methods for imbalanced data with high complexity
title_sort	optimal selection of resampling methods for imbalanced data with high complexity
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10374143/ https://www.ncbi.nlm.nih.gov/pubmed/37498823 http://dx.doi.org/10.1371/journal.pone.0288540
work_keys_str_mv	AT kimannie optimalselectionofresamplingmethodsforimbalanceddatawithhighcomplexity AT junginkyung optimalselectionofresamplingmethodsforimbalanceddatawithhighcomplexity

Optimal selection of resampling methods for imbalanced data with high complexity

Ejemplares similares