Cargando…

Surrogate minimal depth as an importance measure for variables in random forests

MOTIVATION: It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor var...

Descripción completa

Detalles Bibliográficos
Autores principales: Seifert, Stephan, Gundlach, Sven, Szymczak, Silke
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6761946/
https://www.ncbi.nlm.nih.gov/pubmed/30824905
http://dx.doi.org/10.1093/bioinformatics/btz149
_version_ 1783454129490755584
author Seifert, Stephan
Gundlach, Sven
Szymczak, Silke
author_facet Seifert, Stephan
Gundlach, Sven
Szymczak, Silke
author_sort Seifert, Stephan
collection PubMed
description MOTIVATION: It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor variables, in particular between causal predictor variables, make the interpretation of currently applied variable selection techniques difficult. RESULTS: Here we propose a new variable selection approach called surrogate minimal depth (SMD) that incorporates surrogate variables into the concept of minimal depth (MD) variable importance. Applying SMD, we show that simulated correlation patterns can be reconstructed and that the increased consideration of variable relationships improves variable selection. When compared with existing state-of-the-art methods and MD, SMD has higher empirical power to identify causal variables while the resulting variable lists are equally stable. In conclusion, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcome in a high-dimensional data setting. AVAILABILITY AND IMPLEMENTATION: https://github.com/StephanSeifert/SurrogateMinimalDepth. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6761946
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-67619462019-10-02 Surrogate minimal depth as an importance measure for variables in random forests Seifert, Stephan Gundlach, Sven Szymczak, Silke Bioinformatics Original Papers MOTIVATION: It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor variables, in particular between causal predictor variables, make the interpretation of currently applied variable selection techniques difficult. RESULTS: Here we propose a new variable selection approach called surrogate minimal depth (SMD) that incorporates surrogate variables into the concept of minimal depth (MD) variable importance. Applying SMD, we show that simulated correlation patterns can be reconstructed and that the increased consideration of variable relationships improves variable selection. When compared with existing state-of-the-art methods and MD, SMD has higher empirical power to identify causal variables while the resulting variable lists are equally stable. In conclusion, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcome in a high-dimensional data setting. AVAILABILITY AND IMPLEMENTATION: https://github.com/StephanSeifert/SurrogateMinimalDepth. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-10-01 2019-03-01 /pmc/articles/PMC6761946/ /pubmed/30824905 http://dx.doi.org/10.1093/bioinformatics/btz149 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Seifert, Stephan
Gundlach, Sven
Szymczak, Silke
Surrogate minimal depth as an importance measure for variables in random forests
title Surrogate minimal depth as an importance measure for variables in random forests
title_full Surrogate minimal depth as an importance measure for variables in random forests
title_fullStr Surrogate minimal depth as an importance measure for variables in random forests
title_full_unstemmed Surrogate minimal depth as an importance measure for variables in random forests
title_short Surrogate minimal depth as an importance measure for variables in random forests
title_sort surrogate minimal depth as an importance measure for variables in random forests
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6761946/
https://www.ncbi.nlm.nih.gov/pubmed/30824905
http://dx.doi.org/10.1093/bioinformatics/btz149
work_keys_str_mv AT seifertstephan surrogateminimaldepthasanimportancemeasureforvariablesinrandomforests
AT gundlachsven surrogateminimaldepthasanimportancemeasureforvariablesinrandomforests
AT szymczaksilke surrogateminimaldepthasanimportancemeasureforvariablesinrandomforests