Cargando…

Optimally splitting cases for training and testing high dimensional classifiers

BACKGROUND: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its perf...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dobbin, Kevin K, Simon, Richard M
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3090739/ https://www.ncbi.nlm.nih.gov/pubmed/21477282 http://dx.doi.org/10.1186/1755-8794-4-31

_version_	1782203171186147328
author	Dobbin, Kevin K Simon, Richard M
author_facet	Dobbin, Kevin K Simon, Richard M
author_sort	Dobbin, Kevin K
collection	PubMed
description	BACKGROUND: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? RESULTS: We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. CONCLUSIONS: By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.
format	Text
id	pubmed-3090739
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30907392011-05-11 Optimally splitting cases for training and testing high dimensional classifiers Dobbin, Kevin K Simon, Richard M BMC Med Genomics Research Article BACKGROUND: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? RESULTS: We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. CONCLUSIONS: By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split. BioMed Central 2011-04-08 /pmc/articles/PMC3090739/ /pubmed/21477282 http://dx.doi.org/10.1186/1755-8794-4-31 Text en Copyright ©2011 Dobbin and Simon; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Dobbin, Kevin K Simon, Richard M Optimally splitting cases for training and testing high dimensional classifiers
title	Optimally splitting cases for training and testing high dimensional classifiers
title_full	Optimally splitting cases for training and testing high dimensional classifiers
title_fullStr	Optimally splitting cases for training and testing high dimensional classifiers
title_full_unstemmed	Optimally splitting cases for training and testing high dimensional classifiers
title_short	Optimally splitting cases for training and testing high dimensional classifiers
title_sort	optimally splitting cases for training and testing high dimensional classifiers
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3090739/ https://www.ncbi.nlm.nih.gov/pubmed/21477282 http://dx.doi.org/10.1186/1755-8794-4-31
work_keys_str_mv	AT dobbinkevink optimallysplittingcasesfortrainingandtestinghighdimensionalclassifiers AT simonrichardm optimallysplittingcasesfortrainingandtestinghighdimensionalclassifiers

Optimally splitting cases for training and testing high dimensional classifiers

Ejemplares similares