Cargando…

Gene selection and classification of microarray data using random forest

BACKGROUND: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinic...

Descripción completa

Detalles Bibliográficos
Autores principales: Díaz-Uriarte, Ramón, Alvarez de Andrés, Sara
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1363357/
https://www.ncbi.nlm.nih.gov/pubmed/16398926
http://dx.doi.org/10.1186/1471-2105-7-3
_version_ 1782126734158594048
author Díaz-Uriarte, Ramón
Alvarez de Andrés, Sara
author_facet Díaz-Uriarte, Ramón
Alvarez de Andrés, Sara
author_sort Díaz-Uriarte, Ramón
collection PubMed
description BACKGROUND: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. RESULTS: We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. CONCLUSION: Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
format Text
id pubmed-1363357
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-13633572006-02-10 Gene selection and classification of microarray data using random forest Díaz-Uriarte, Ramón Alvarez de Andrés, Sara BMC Bioinformatics Methodology Article BACKGROUND: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. RESULTS: We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. CONCLUSION: Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data. BioMed Central 2006-01-06 /pmc/articles/PMC1363357/ /pubmed/16398926 http://dx.doi.org/10.1186/1471-2105-7-3 Text en Copyright © 2006 Díaz-Uriarte and Alvarez de Andrés; licensee BioMed Central Ltd.
spellingShingle Methodology Article
Díaz-Uriarte, Ramón
Alvarez de Andrés, Sara
Gene selection and classification of microarray data using random forest
title Gene selection and classification of microarray data using random forest
title_full Gene selection and classification of microarray data using random forest
title_fullStr Gene selection and classification of microarray data using random forest
title_full_unstemmed Gene selection and classification of microarray data using random forest
title_short Gene selection and classification of microarray data using random forest
title_sort gene selection and classification of microarray data using random forest
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1363357/
https://www.ncbi.nlm.nih.gov/pubmed/16398926
http://dx.doi.org/10.1186/1471-2105-7-3
work_keys_str_mv AT diazuriarteramon geneselectionandclassificationofmicroarraydatausingrandomforest
AT alvarezdeandressara geneselectionandclassificationofmicroarraydatausingrandomforest