Cargando…

Splitting on categorical predictors in random forests

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predic...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wright, Marvin N., König, Inke R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2019
Materias:	Statistics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6368971/ https://www.ncbi.nlm.nih.gov/pubmed/30746306 http://dx.doi.org/10.7717/peerj.6339

_version_	1783394091617222656
author	Wright, Marvin N. König, Inke R.
author_facet	Wright, Marvin N. König, Inke R.
author_sort	Wright, Marvin N.
collection	PubMed
description	One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2(k − 1) − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.
format	Online Article Text
id	pubmed-6368971
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-63689712019-02-11 Splitting on categorical predictors in random forests Wright, Marvin N. König, Inke R. PeerJ Statistics One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2(k − 1) − 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k − 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs. PeerJ Inc. 2019-02-07 /pmc/articles/PMC6368971/ /pubmed/30746306 http://dx.doi.org/10.7717/peerj.6339 Text en © 2019 Wright et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Statistics Wright, Marvin N. König, Inke R. Splitting on categorical predictors in random forests
title	Splitting on categorical predictors in random forests
title_full	Splitting on categorical predictors in random forests
title_fullStr	Splitting on categorical predictors in random forests
title_full_unstemmed	Splitting on categorical predictors in random forests
title_short	Splitting on categorical predictors in random forests
title_sort	splitting on categorical predictors in random forests
topic	Statistics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6368971/ https://www.ncbi.nlm.nih.gov/pubmed/30746306 http://dx.doi.org/10.7717/peerj.6339
work_keys_str_mv	AT wrightmarvinn splittingoncategoricalpredictorsinrandomforests AT koniginker splittingoncategoricalpredictorsinrandomforests

Splitting on categorical predictors in random forests

Ejemplares similares