Cargando…

Using recursive feature elimination in random forest to account for correlated variables in high dimensional data

BACKGROUND: Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Rando...

Descripción completa

Detalles Bibliográficos
Autores principales: Darst, Burcu F., Malecki, Kristen C., Engelman, Corinne D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157185/
https://www.ncbi.nlm.nih.gov/pubmed/30255764
http://dx.doi.org/10.1186/s12863-018-0633-8
_version_ 1783358228889862144
author Darst, Burcu F.
Malecki, Kristen C.
Engelman, Corinne D.
author_facet Darst, Burcu F.
Malecki, Kristen C.
Engelman, Corinne D.
author_sort Darst, Burcu F.
collection PubMed
description BACKGROUND: Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets. RESULTS: We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype–methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables. CONCLUSIONS: Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data.
format Online
Article
Text
id pubmed-6157185
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-61571852018-10-01 Using recursive feature elimination in random forest to account for correlated variables in high dimensional data Darst, Burcu F. Malecki, Kristen C. Engelman, Corinne D. BMC Genet Research BACKGROUND: Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets. RESULTS: We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype–methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables. CONCLUSIONS: Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data. BioMed Central 2018-09-17 /pmc/articles/PMC6157185/ /pubmed/30255764 http://dx.doi.org/10.1186/s12863-018-0633-8 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Darst, Burcu F.
Malecki, Kristen C.
Engelman, Corinne D.
Using recursive feature elimination in random forest to account for correlated variables in high dimensional data
title Using recursive feature elimination in random forest to account for correlated variables in high dimensional data
title_full Using recursive feature elimination in random forest to account for correlated variables in high dimensional data
title_fullStr Using recursive feature elimination in random forest to account for correlated variables in high dimensional data
title_full_unstemmed Using recursive feature elimination in random forest to account for correlated variables in high dimensional data
title_short Using recursive feature elimination in random forest to account for correlated variables in high dimensional data
title_sort using recursive feature elimination in random forest to account for correlated variables in high dimensional data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6157185/
https://www.ncbi.nlm.nih.gov/pubmed/30255764
http://dx.doi.org/10.1186/s12863-018-0633-8
work_keys_str_mv AT darstburcuf usingrecursivefeatureeliminationinrandomforesttoaccountforcorrelatedvariablesinhighdimensionaldata
AT maleckikristenc usingrecursivefeatureeliminationinrandomforesttoaccountforcorrelatedvariablesinhighdimensionaldata
AT engelmancorinned usingrecursivefeatureeliminationinrandomforesttoaccountforcorrelatedvariablesinhighdimensionaldata