Cargando…

Class prediction for high-dimensional class-imbalanced data

BACKGROUND: The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly...

Descripción completa

Detalles Bibliográficos
Autores principales:	Blagus, Rok, Lusa, Lara
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098087/ https://www.ncbi.nlm.nih.gov/pubmed/20961420 http://dx.doi.org/10.1186/1471-2105-11-523

_version_	1782203914468196352
author	Blagus, Rok Lusa, Lara
author_facet	Blagus, Rok Lusa, Lara
author_sort	Blagus, Rok
collection	PubMed
description	BACKGROUND: The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. RESULTS: Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. CONCLUSIONS: Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.
format	Text
id	pubmed-3098087
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30980872011-07-08 Class prediction for high-dimensional class-imbalanced data Blagus, Rok Lusa, Lara BMC Bioinformatics Research Article BACKGROUND: The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. RESULTS: Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. CONCLUSIONS: Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem. BioMed Central 2010-10-20 /pmc/articles/PMC3098087/ /pubmed/20961420 http://dx.doi.org/10.1186/1471-2105-11-523 Text en Copyright ©2010 Blagus and Lusa; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Blagus, Rok Lusa, Lara Class prediction for high-dimensional class-imbalanced data
title	Class prediction for high-dimensional class-imbalanced data
title_full	Class prediction for high-dimensional class-imbalanced data
title_fullStr	Class prediction for high-dimensional class-imbalanced data
title_full_unstemmed	Class prediction for high-dimensional class-imbalanced data
title_short	Class prediction for high-dimensional class-imbalanced data
title_sort	class prediction for high-dimensional class-imbalanced data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098087/ https://www.ncbi.nlm.nih.gov/pubmed/20961420 http://dx.doi.org/10.1186/1471-2105-11-523
work_keys_str_mv	AT blagusrok classpredictionforhighdimensionalclassimbalanceddata AT lusalara classpredictionforhighdimensionalclassimbalanceddata

Class prediction for high-dimensional class-imbalanced data

Ejemplares similares