Cargando…

Effect of training-sample size and classification difficulty on the accuracy of genomic predictors

INTRODUCTION: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by thr...

Descripción completa

Detalles Bibliográficos
Autores principales: Popovici, Vlad, Chen, Weijie, Gallas, Brandon G, Hatzis, Christos, Shi, Weiwei, Samuelson, Frank W, Nikolsky, Yuri, Tsyganova, Marina, Ishkin, Alex, Nikolskaya, Tatiana, Hess, Kenneth R, Valero, Vicente, Booser, Daniel, Delorenzi, Mauro, Hortobagyi, Gabriel N, Shi, Leming, Symmans, W Fraser, Pusztai, Lajos
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880423/
https://www.ncbi.nlm.nih.gov/pubmed/20064235
http://dx.doi.org/10.1186/bcr2468
_version_ 1782182027741626368
author Popovici, Vlad
Chen, Weijie
Gallas, Brandon G
Hatzis, Christos
Shi, Weiwei
Samuelson, Frank W
Nikolsky, Yuri
Tsyganova, Marina
Ishkin, Alex
Nikolskaya, Tatiana
Hess, Kenneth R
Valero, Vicente
Booser, Daniel
Delorenzi, Mauro
Hortobagyi, Gabriel N
Shi, Leming
Symmans, W Fraser
Pusztai, Lajos
author_facet Popovici, Vlad
Chen, Weijie
Gallas, Brandon G
Hatzis, Christos
Shi, Weiwei
Samuelson, Frank W
Nikolsky, Yuri
Tsyganova, Marina
Ishkin, Alex
Nikolskaya, Tatiana
Hess, Kenneth R
Valero, Vicente
Booser, Daniel
Delorenzi, Mauro
Hortobagyi, Gabriel N
Shi, Leming
Symmans, W Fraser
Pusztai, Lajos
author_sort Popovici, Vlad
collection PubMed
description INTRODUCTION: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. METHODS: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. RESULTS: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. CONCLUSIONS: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.
format Text
id pubmed-2880423
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28804232010-06-04 Effect of training-sample size and classification difficulty on the accuracy of genomic predictors Popovici, Vlad Chen, Weijie Gallas, Brandon G Hatzis, Christos Shi, Weiwei Samuelson, Frank W Nikolsky, Yuri Tsyganova, Marina Ishkin, Alex Nikolskaya, Tatiana Hess, Kenneth R Valero, Vicente Booser, Daniel Delorenzi, Mauro Hortobagyi, Gabriel N Shi, Leming Symmans, W Fraser Pusztai, Lajos Breast Cancer Res Research article INTRODUCTION: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. METHODS: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. RESULTS: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. CONCLUSIONS: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem. BioMed Central 2010 2010-01-11 /pmc/articles/PMC2880423/ /pubmed/20064235 http://dx.doi.org/10.1186/bcr2468 Text en Copyright ©2010 Popovici et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Popovici, Vlad
Chen, Weijie
Gallas, Brandon G
Hatzis, Christos
Shi, Weiwei
Samuelson, Frank W
Nikolsky, Yuri
Tsyganova, Marina
Ishkin, Alex
Nikolskaya, Tatiana
Hess, Kenneth R
Valero, Vicente
Booser, Daniel
Delorenzi, Mauro
Hortobagyi, Gabriel N
Shi, Leming
Symmans, W Fraser
Pusztai, Lajos
Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
title Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
title_full Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
title_fullStr Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
title_full_unstemmed Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
title_short Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
title_sort effect of training-sample size and classification difficulty on the accuracy of genomic predictors
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880423/
https://www.ncbi.nlm.nih.gov/pubmed/20064235
http://dx.doi.org/10.1186/bcr2468
work_keys_str_mv AT popovicivlad effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT chenweijie effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT gallasbrandong effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT hatzischristos effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT shiweiwei effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT samuelsonfrankw effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT nikolskyyuri effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT tsyganovamarina effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT ishkinalex effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT nikolskayatatiana effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT hesskennethr effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT valerovicente effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT booserdaniel effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT delorenzimauro effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT hortobagyigabrieln effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT shileming effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT symmanswfraser effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors
AT pusztailajos effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors