Cargando…

SMOTE for high-dimensional class-imbalanced data

BACKGROUND: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class...

Descripción completa

Detalles Bibliográficos
Autores principales:	Blagus, Rok, Lusa, Lara
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/ https://www.ncbi.nlm.nih.gov/pubmed/23522326 http://dx.doi.org/10.1186/1471-2105-14-106

_version_	1782268842864541696
author	Blagus, Rok Lusa, Lara
author_facet	Blagus, Rok Lusa, Lara
author_sort	Blagus, Rok
collection	PubMed
description	BACKGROUND: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. RESULTS: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. CONCLUSIONS: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.
format	Online Article Text
id	pubmed-3648438
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-36484382013-05-10 SMOTE for high-dimensional class-imbalanced data Blagus, Rok Lusa, Lara BMC Bioinformatics Research Article BACKGROUND: Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. RESULTS: While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. CONCLUSIONS: In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class. BioMed Central 2013-03-22 /pmc/articles/PMC3648438/ /pubmed/23522326 http://dx.doi.org/10.1186/1471-2105-14-106 Text en Copyright © 2013 Blagus and Lusa; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Blagus, Rok Lusa, Lara SMOTE for high-dimensional class-imbalanced data
title	SMOTE for high-dimensional class-imbalanced data
title_full	SMOTE for high-dimensional class-imbalanced data
title_fullStr	SMOTE for high-dimensional class-imbalanced data
title_full_unstemmed	SMOTE for high-dimensional class-imbalanced data
title_short	SMOTE for high-dimensional class-imbalanced data
title_sort	smote for high-dimensional class-imbalanced data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/ https://www.ncbi.nlm.nih.gov/pubmed/23522326 http://dx.doi.org/10.1186/1471-2105-14-106
work_keys_str_mv	AT blagusrok smoteforhighdimensionalclassimbalanceddata AT lusalara smoteforhighdimensionalclassimbalanceddata

SMOTE for high-dimensional class-imbalanced data

Ejemplares similares