Cargando…

Bias in random forest variable importance measures: Illustrations, sources and a solution

BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certa...

Descripción completa

Detalles Bibliográficos
Autores principales:	Strobl, Carolin, Boulesteix, Anne-Laure, Zeileis, Achim, Hothorn, Torsten
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1796903/ https://www.ncbi.nlm.nih.gov/pubmed/17254353 http://dx.doi.org/10.1186/1471-2105-8-25

_version_	1782132269981368320
author	Strobl, Carolin Boulesteix, Anne-Laure Zeileis, Achim Hothorn, Torsten
author_facet	Strobl, Carolin Boulesteix, Anne-Laure Zeileis, Achim Hothorn, Torsten
author_sort	Strobl, Carolin
collection	PubMed
description	BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
format	Text
id	pubmed-1796903
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-17969032007-02-16 Bias in random forest variable importance measures: Illustrations, sources and a solution Strobl, Carolin Boulesteix, Anne-Laure Zeileis, Achim Hothorn, Torsten BMC Bioinformatics Methodology Article BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. BioMed Central 2007-01-25 /pmc/articles/PMC1796903/ /pubmed/17254353 http://dx.doi.org/10.1186/1471-2105-8-25 Text en Copyright © 2007 Strobl et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Strobl, Carolin Boulesteix, Anne-Laure Zeileis, Achim Hothorn, Torsten Bias in random forest variable importance measures: Illustrations, sources and a solution
title	Bias in random forest variable importance measures: Illustrations, sources and a solution
title_full	Bias in random forest variable importance measures: Illustrations, sources and a solution
title_fullStr	Bias in random forest variable importance measures: Illustrations, sources and a solution
title_full_unstemmed	Bias in random forest variable importance measures: Illustrations, sources and a solution
title_short	Bias in random forest variable importance measures: Illustrations, sources and a solution
title_sort	bias in random forest variable importance measures: illustrations, sources and a solution
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1796903/ https://www.ncbi.nlm.nih.gov/pubmed/17254353 http://dx.doi.org/10.1186/1471-2105-8-25
work_keys_str_mv	AT stroblcarolin biasinrandomforestvariableimportancemeasuresillustrationssourcesandasolution AT boulesteixannelaure biasinrandomforestvariableimportancemeasuresillustrationssourcesandasolution AT zeileisachim biasinrandomforestvariableimportancemeasuresillustrationssourcesandasolution AT hothorntorsten biasinrandomforestvariableimportancemeasuresillustrationssourcesandasolution

Bias in random forest variable importance measures: Illustrations, sources and a solution

Ejemplares similares