Cargando…

Boosting for high-dimensional two-class prediction

BACKGROUND: In clinical research prediction models are used to accurately predict the outcome of the patients based on some of their characteristics. For high-dimensional prediction models (the number of variables greatly exceeds the number of samples) the choice of an appropriate classifier is cruc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Blagus, Rok, Lusa, Lara
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4578758/ https://www.ncbi.nlm.nih.gov/pubmed/26390865 http://dx.doi.org/10.1186/s12859-015-0723-9

_version_	1782391160874991616
author	Blagus, Rok Lusa, Lara
author_facet	Blagus, Rok Lusa, Lara
author_sort	Blagus, Rok
collection	PubMed
description	BACKGROUND: In clinical research prediction models are used to accurately predict the outcome of the patients based on some of their characteristics. For high-dimensional prediction models (the number of variables greatly exceeds the number of samples) the choice of an appropriate classifier is crucial as it was observed that no single classification algorithm performs optimally for all types of data. Boosting was proposed as a method that combines the classification results obtained using base classifiers, where the sample weights are sequentially adjusted based on the performance in previous iterations. Generally boosting outperforms any individual classifier, but studies with high-dimensional data showed that the most standard boosting algorithm, AdaBoost.M1, cannot significantly improve the performance of its base classier. Recently other boosting algorithms were proposed (Gradient boosting, Stochastic Gradient boosting, LogitBoost); they were shown to perform better than AdaBoost.M1 but their performance was not evaluated for high-dimensional data. RESULTS: In this paper we use simulation studies and real gene-expression data sets to evaluate the performance of boosting algorithms when data are high-dimensional. Our results confirm that AdaBoost.M1 can perform poorly in this setting, often failing to improve the performance of its base classifier. We provide the explanation for this and propose a modification, AdaBoost.M1.ICV, which uses cross-validated estimates of the prediction errors and outperforms the original algorithm when data are high-dimensional. The use of AdaBoost.M1.ICV is advisable when the base classifier overfits the training data: the number of variables is large, the number of samples is small, and/or the difference between the classes is large. To a lesser extent also Gradient boosting suffers from similar problems. Contrary to the findings for the low-dimensional data, shrinkage does not improve the performance of Gradient boosting when data are high-dimensional, however it is beneficial for Stochastic Gradient boosting, which outperformed the other boosting algorithms in our analyses. LogitBoost suffers from overfitting and generally performs poorly. CONCLUSIONS: The results show that boosting can substantially improve the performance of its base classifier also when data are high-dimensional. However, not all boosting algorithms perform equally well. LogitBoost, AdaBoost.M1 and Gradient boosting seem less useful for this type of data. Overall, Stochastic Gradient boosting with shrinkage and AdaBoost.M1.ICV seem to be the preferable choices for high-dimensional class-prediction. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0723-9) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4578758
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-45787582015-09-23 Boosting for high-dimensional two-class prediction Blagus, Rok Lusa, Lara BMC Bioinformatics Research Article BACKGROUND: In clinical research prediction models are used to accurately predict the outcome of the patients based on some of their characteristics. For high-dimensional prediction models (the number of variables greatly exceeds the number of samples) the choice of an appropriate classifier is crucial as it was observed that no single classification algorithm performs optimally for all types of data. Boosting was proposed as a method that combines the classification results obtained using base classifiers, where the sample weights are sequentially adjusted based on the performance in previous iterations. Generally boosting outperforms any individual classifier, but studies with high-dimensional data showed that the most standard boosting algorithm, AdaBoost.M1, cannot significantly improve the performance of its base classier. Recently other boosting algorithms were proposed (Gradient boosting, Stochastic Gradient boosting, LogitBoost); they were shown to perform better than AdaBoost.M1 but their performance was not evaluated for high-dimensional data. RESULTS: In this paper we use simulation studies and real gene-expression data sets to evaluate the performance of boosting algorithms when data are high-dimensional. Our results confirm that AdaBoost.M1 can perform poorly in this setting, often failing to improve the performance of its base classifier. We provide the explanation for this and propose a modification, AdaBoost.M1.ICV, which uses cross-validated estimates of the prediction errors and outperforms the original algorithm when data are high-dimensional. The use of AdaBoost.M1.ICV is advisable when the base classifier overfits the training data: the number of variables is large, the number of samples is small, and/or the difference between the classes is large. To a lesser extent also Gradient boosting suffers from similar problems. Contrary to the findings for the low-dimensional data, shrinkage does not improve the performance of Gradient boosting when data are high-dimensional, however it is beneficial for Stochastic Gradient boosting, which outperformed the other boosting algorithms in our analyses. LogitBoost suffers from overfitting and generally performs poorly. CONCLUSIONS: The results show that boosting can substantially improve the performance of its base classifier also when data are high-dimensional. However, not all boosting algorithms perform equally well. LogitBoost, AdaBoost.M1 and Gradient boosting seem less useful for this type of data. Overall, Stochastic Gradient boosting with shrinkage and AdaBoost.M1.ICV seem to be the preferable choices for high-dimensional class-prediction. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0723-9) contains supplementary material, which is available to authorized users. BioMed Central 2015-09-21 /pmc/articles/PMC4578758/ /pubmed/26390865 http://dx.doi.org/10.1186/s12859-015-0723-9 Text en © Blagus and Lusa. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Blagus, Rok Lusa, Lara Boosting for high-dimensional two-class prediction
title	Boosting for high-dimensional two-class prediction
title_full	Boosting for high-dimensional two-class prediction
title_fullStr	Boosting for high-dimensional two-class prediction
title_full_unstemmed	Boosting for high-dimensional two-class prediction
title_short	Boosting for high-dimensional two-class prediction
title_sort	boosting for high-dimensional two-class prediction
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4578758/ https://www.ncbi.nlm.nih.gov/pubmed/26390865 http://dx.doi.org/10.1186/s12859-015-0723-9
work_keys_str_mv	AT blagusrok boostingforhighdimensionaltwoclassprediction AT lusalara boostingforhighdimensionaltwoclassprediction

Boosting for high-dimensional two-class prediction

Ejemplares similares