Cargando…

The revival of the Gini importance?

MOTIVATION: Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nembrini, Stefano, König, Inke R, Wright, Marvin N
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6198850/ https://www.ncbi.nlm.nih.gov/pubmed/29757357 http://dx.doi.org/10.1093/bioinformatics/bty373

_version_	1783365029145346048
author	Nembrini, Stefano König, Inke R Wright, Marvin N
author_facet	Nembrini, Stefano König, Inke R Wright, Marvin N
author_sort	Nembrini, Stefano
collection	PubMed
description	MOTIVATION: Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. RESULTS: We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. AVAILABILITY AND IMPLEMENTATION: The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6198850
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-61988502018-10-26 The revival of the Gini importance? Nembrini, Stefano König, Inke R Wright, Marvin N Bioinformatics Original Papers MOTIVATION: Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. RESULTS: We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. AVAILABILITY AND IMPLEMENTATION: The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-11-01 2018-05-10 /pmc/articles/PMC6198850/ /pubmed/29757357 http://dx.doi.org/10.1093/bioinformatics/bty373 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Nembrini, Stefano König, Inke R Wright, Marvin N The revival of the Gini importance?
title	The revival of the Gini importance?
title_full	The revival of the Gini importance?
title_fullStr	The revival of the Gini importance?
title_full_unstemmed	The revival of the Gini importance?
title_short	The revival of the Gini importance?
title_sort	revival of the gini importance?
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6198850/ https://www.ncbi.nlm.nih.gov/pubmed/29757357 http://dx.doi.org/10.1093/bioinformatics/bty373
work_keys_str_mv	AT nembrinistefano therevivaloftheginiimportance AT koniginker therevivaloftheginiimportance AT wrightmarvinn therevivaloftheginiimportance AT nembrinistefano revivaloftheginiimportance AT koniginker revivaloftheginiimportance AT wrightmarvinn revivaloftheginiimportance

The revival of the Gini importance?

Ejemplares similares