Cargando…

A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets

BACKGROUND: Gene selection is an important step when building predictors of disease state based on gene expression data. Gene selection generally improves performance and identifies a relevant subset of genes. Many univariate and multivariate gene selection approaches have been proposed. Frequently...

Descripción completa

Detalles Bibliográficos
Autores principales: Lai, Carmen, Reinders, Marcel JT, van't Veer, Laura J, Wessels, Lodewyk FA
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1569875/
https://www.ncbi.nlm.nih.gov/pubmed/16670007
http://dx.doi.org/10.1186/1471-2105-7-235
_version_ 1782130227195936768
author Lai, Carmen
Reinders, Marcel JT
van't Veer, Laura J
Wessels, Lodewyk FA
author_facet Lai, Carmen
Reinders, Marcel JT
van't Veer, Laura J
Wessels, Lodewyk FA
author_sort Lai, Carmen
collection PubMed
description BACKGROUND: Gene selection is an important step when building predictors of disease state based on gene expression data. Gene selection generally improves performance and identifies a relevant subset of genes. Many univariate and multivariate gene selection approaches have been proposed. Frequently the claim is made that genes are co-regulated (due to pathway dependencies) and that multivariate approaches are therefore per definition more desirable than univariate selection approaches. Based on the published performances of all these approaches a fair comparison of the available results can not be made. This mainly stems from two factors. First, the results are often biased, since the validation set is in one way or another involved in training the predictor, resulting in optimistically biased performance estimates. Second, the published results are often based on a small number of relatively simple datasets. Consequently no generally applicable conclusions can be drawn. RESULTS: In this study we adopted an unbiased protocol to perform a fair comparison of frequently used multivariate and univariate gene selection techniques, in combination with a ränge of classifiers. Our conclusions are based on seven gene expression datasets, across several cancer types. CONCLUSION: Our experiments illustrate that, contrary to several previous studies, in five of the seven datasets univariate selection approaches yield consistently better results than multivariate approaches. The simplest multivariate selection approach, the Top Scoring method, achieves the best results on the remaining two datasets. We conclude that the correlation structures, if present, are difficult to extract due to the small number of samples, and that consequently, overly-complex gene selection algorithms that attempt to extract these structures are prone to overtraining.
format Text
id pubmed-1569875
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15698752006-09-26 A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets Lai, Carmen Reinders, Marcel JT van't Veer, Laura J Wessels, Lodewyk FA BMC Bioinformatics Research Article BACKGROUND: Gene selection is an important step when building predictors of disease state based on gene expression data. Gene selection generally improves performance and identifies a relevant subset of genes. Many univariate and multivariate gene selection approaches have been proposed. Frequently the claim is made that genes are co-regulated (due to pathway dependencies) and that multivariate approaches are therefore per definition more desirable than univariate selection approaches. Based on the published performances of all these approaches a fair comparison of the available results can not be made. This mainly stems from two factors. First, the results are often biased, since the validation set is in one way or another involved in training the predictor, resulting in optimistically biased performance estimates. Second, the published results are often based on a small number of relatively simple datasets. Consequently no generally applicable conclusions can be drawn. RESULTS: In this study we adopted an unbiased protocol to perform a fair comparison of frequently used multivariate and univariate gene selection techniques, in combination with a ränge of classifiers. Our conclusions are based on seven gene expression datasets, across several cancer types. CONCLUSION: Our experiments illustrate that, contrary to several previous studies, in five of the seven datasets univariate selection approaches yield consistently better results than multivariate approaches. The simplest multivariate selection approach, the Top Scoring method, achieves the best results on the remaining two datasets. We conclude that the correlation structures, if present, are difficult to extract due to the small number of samples, and that consequently, overly-complex gene selection algorithms that attempt to extract these structures are prone to overtraining. BioMed Central 2006-05-02 /pmc/articles/PMC1569875/ /pubmed/16670007 http://dx.doi.org/10.1186/1471-2105-7-235 Text en Copyright © 2006 Lai et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Lai, Carmen
Reinders, Marcel JT
van't Veer, Laura J
Wessels, Lodewyk FA
A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
title A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
title_full A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
title_fullStr A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
title_full_unstemmed A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
title_short A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
title_sort comparison of univariate and multivariate gene selection techniques for classification of cancer datasets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1569875/
https://www.ncbi.nlm.nih.gov/pubmed/16670007
http://dx.doi.org/10.1186/1471-2105-7-235
work_keys_str_mv AT laicarmen acomparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets
AT reindersmarceljt acomparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets
AT vantveerlauraj acomparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets
AT wesselslodewykfa acomparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets
AT laicarmen comparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets
AT reindersmarceljt comparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets
AT vantveerlauraj comparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets
AT wesselslodewykfa comparisonofunivariateandmultivariategeneselectiontechniquesforclassificationofcancerdatasets