Cargando…
SMOTE for high-dimensional class-imbalanced data
BACKGROUND: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/ https://www.ncbi.nlm.nih.gov/pubmed/23522326 http://dx.doi.org/10.1186/1471-2105-14-106 |
_version_ | 1782268842864541696 |
---|---|
author | Blagus, Rok Lusa, Lara |
author_facet | Blagus, Rok Lusa, Lara |
author_sort | Blagus, Rok |
collection | PubMed |
description | BACKGROUND: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. RESULTS: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. CONCLUSIONS: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class. |
format | Online Article Text |
id | pubmed-3648438 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-36484382013-05-10 SMOTE for high-dimensional class-imbalanced data Blagus, Rok Lusa, Lara BMC Bioinformatics Research Article BACKGROUND: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. RESULTS: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. CONCLUSIONS: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class. BioMed Central 2013-03-22 /pmc/articles/PMC3648438/ /pubmed/23522326 http://dx.doi.org/10.1186/1471-2105-14-106 Text en Copyright © 2013 Blagus and Lusa; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Blagus, Rok Lusa, Lara SMOTE for high-dimensional class-imbalanced data |
title | SMOTE for high-dimensional class-imbalanced data |
title_full | SMOTE for high-dimensional class-imbalanced data |
title_fullStr | SMOTE for high-dimensional class-imbalanced data |
title_full_unstemmed | SMOTE for high-dimensional class-imbalanced data |
title_short | SMOTE for high-dimensional class-imbalanced data |
title_sort | smote for high-dimensional class-imbalanced data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/ https://www.ncbi.nlm.nih.gov/pubmed/23522326 http://dx.doi.org/10.1186/1471-2105-14-106 |
work_keys_str_mv | AT blagusrok smoteforhighdimensionalclassimbalanceddata AT lusalara smoteforhighdimensionalclassimbalanceddata |