Cargando…

Improving random forest predictions in small datasets from two-phase sampling designs

BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcom...

Descripción completa

Detalles Bibliográficos
Autores principales:	Han, Sunwoo, Williamson, Brian D., Fong, Youyi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607560/ https://www.ncbi.nlm.nih.gov/pubmed/34809631 http://dx.doi.org/10.1186/s12911-021-01688-3

_version_	1784602584282038272
author	Han, Sunwoo Williamson, Brian D. Fong, Youyi
author_facet	Han, Sunwoo Williamson, Brian D. Fong, Youyi
author_sort	Han, Sunwoo
collection	PubMed
description	BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01688-3.
format	Online Article Text
id	pubmed-8607560
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-86075602021-11-22 Improving random forest predictions in small datasets from two-phase sampling designs Han, Sunwoo Williamson, Brian D. Fong, Youyi BMC Med Inform Decis Mak Research Article BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01688-3. BioMed Central 2021-11-22 /pmc/articles/PMC8607560/ /pubmed/34809631 http://dx.doi.org/10.1186/s12911-021-01688-3 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Han, Sunwoo Williamson, Brian D. Fong, Youyi Improving random forest predictions in small datasets from two-phase sampling designs
title	Improving random forest predictions in small datasets from two-phase sampling designs
title_full	Improving random forest predictions in small datasets from two-phase sampling designs
title_fullStr	Improving random forest predictions in small datasets from two-phase sampling designs
title_full_unstemmed	Improving random forest predictions in small datasets from two-phase sampling designs
title_short	Improving random forest predictions in small datasets from two-phase sampling designs
title_sort	improving random forest predictions in small datasets from two-phase sampling designs
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607560/ https://www.ncbi.nlm.nih.gov/pubmed/34809631 http://dx.doi.org/10.1186/s12911-021-01688-3
work_keys_str_mv	AT hansunwoo improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns AT williamsonbriand improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns AT fongyouyi improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns

Improving random forest predictions in small datasets from two-phase sampling designs

Ejemplares similares