Cargando…
Effect of training-sample size and classification difficulty on the accuracy of genomic predictors
INTRODUCTION: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by thr...
Autores principales: | , , , , , , , , , , , , , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880423/ https://www.ncbi.nlm.nih.gov/pubmed/20064235 http://dx.doi.org/10.1186/bcr2468 |
_version_ | 1782182027741626368 |
---|---|
author | Popovici, Vlad Chen, Weijie Gallas, Brandon G Hatzis, Christos Shi, Weiwei Samuelson, Frank W Nikolsky, Yuri Tsyganova, Marina Ishkin, Alex Nikolskaya, Tatiana Hess, Kenneth R Valero, Vicente Booser, Daniel Delorenzi, Mauro Hortobagyi, Gabriel N Shi, Leming Symmans, W Fraser Pusztai, Lajos |
author_facet | Popovici, Vlad Chen, Weijie Gallas, Brandon G Hatzis, Christos Shi, Weiwei Samuelson, Frank W Nikolsky, Yuri Tsyganova, Marina Ishkin, Alex Nikolskaya, Tatiana Hess, Kenneth R Valero, Vicente Booser, Daniel Delorenzi, Mauro Hortobagyi, Gabriel N Shi, Leming Symmans, W Fraser Pusztai, Lajos |
author_sort | Popovici, Vlad |
collection | PubMed |
description | INTRODUCTION: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. METHODS: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. RESULTS: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. CONCLUSIONS: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem. |
format | Text |
id | pubmed-2880423 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-28804232010-06-04 Effect of training-sample size and classification difficulty on the accuracy of genomic predictors Popovici, Vlad Chen, Weijie Gallas, Brandon G Hatzis, Christos Shi, Weiwei Samuelson, Frank W Nikolsky, Yuri Tsyganova, Marina Ishkin, Alex Nikolskaya, Tatiana Hess, Kenneth R Valero, Vicente Booser, Daniel Delorenzi, Mauro Hortobagyi, Gabriel N Shi, Leming Symmans, W Fraser Pusztai, Lajos Breast Cancer Res Research article INTRODUCTION: As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. METHODS: We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. RESULTS: A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. CONCLUSIONS: We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem. BioMed Central 2010 2010-01-11 /pmc/articles/PMC2880423/ /pubmed/20064235 http://dx.doi.org/10.1186/bcr2468 Text en Copyright ©2010 Popovici et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research article Popovici, Vlad Chen, Weijie Gallas, Brandon G Hatzis, Christos Shi, Weiwei Samuelson, Frank W Nikolsky, Yuri Tsyganova, Marina Ishkin, Alex Nikolskaya, Tatiana Hess, Kenneth R Valero, Vicente Booser, Daniel Delorenzi, Mauro Hortobagyi, Gabriel N Shi, Leming Symmans, W Fraser Pusztai, Lajos Effect of training-sample size and classification difficulty on the accuracy of genomic predictors |
title | Effect of training-sample size and classification difficulty on the accuracy of genomic predictors |
title_full | Effect of training-sample size and classification difficulty on the accuracy of genomic predictors |
title_fullStr | Effect of training-sample size and classification difficulty on the accuracy of genomic predictors |
title_full_unstemmed | Effect of training-sample size and classification difficulty on the accuracy of genomic predictors |
title_short | Effect of training-sample size and classification difficulty on the accuracy of genomic predictors |
title_sort | effect of training-sample size and classification difficulty on the accuracy of genomic predictors |
topic | Research article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880423/ https://www.ncbi.nlm.nih.gov/pubmed/20064235 http://dx.doi.org/10.1186/bcr2468 |
work_keys_str_mv | AT popovicivlad effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT chenweijie effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT gallasbrandong effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT hatzischristos effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT shiweiwei effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT samuelsonfrankw effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT nikolskyyuri effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT tsyganovamarina effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT ishkinalex effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT nikolskayatatiana effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT hesskennethr effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT valerovicente effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT booserdaniel effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT delorenzimauro effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT hortobagyigabrieln effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT shileming effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT symmanswfraser effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors AT pusztailajos effectoftrainingsamplesizeandclassificationdifficultyontheaccuracyofgenomicpredictors |