Cargando…

The parameter sensitivity of random forests

BACKGROUND: The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to n...

Descripción completa

Detalles Bibliográficos
Autores principales:	Huang, Barbara F.F., Boutros, Paul C.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009551/ https://www.ncbi.nlm.nih.gov/pubmed/27586051 http://dx.doi.org/10.1186/s12859-016-1228-x

_version_	1782451534105149440
author	Huang, Barbara F.F. Boutros, Paul C.
author_facet	Huang, Barbara F.F. Boutros, Paul C.
author_sort	Huang, Barbara F.F.
collection	PubMed
description	BACKGROUND: The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here. RESULTS: We examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters. CONCLUSIONS: Parameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1228-x) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5009551
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-50095512016-09-08 The parameter sensitivity of random forests Huang, Barbara F.F. Boutros, Paul C. BMC Bioinformatics Methodology Article BACKGROUND: The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here. RESULTS: We examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters. CONCLUSIONS: Parameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1228-x) contains supplementary material, which is available to authorized users. BioMed Central 2016-09-01 /pmc/articles/PMC5009551/ /pubmed/27586051 http://dx.doi.org/10.1186/s12859-016-1228-x Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Huang, Barbara F.F. Boutros, Paul C. The parameter sensitivity of random forests
title	The parameter sensitivity of random forests
title_full	The parameter sensitivity of random forests
title_fullStr	The parameter sensitivity of random forests
title_full_unstemmed	The parameter sensitivity of random forests
title_short	The parameter sensitivity of random forests
title_sort	parameter sensitivity of random forests
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009551/ https://www.ncbi.nlm.nih.gov/pubmed/27586051 http://dx.doi.org/10.1186/s12859-016-1228-x
work_keys_str_mv	AT huangbarbaraff theparametersensitivityofrandomforests AT boutrospaulc theparametersensitivityofrandomforests AT huangbarbaraff parametersensitivityofrandomforests AT boutrospaulc parametersensitivityofrandomforests

The parameter sensitivity of random forests

Ejemplares similares