Cargando…

Improving random forest predictions in small datasets from two-phase sampling designs

BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcom...

Descripción completa

Detalles Bibliográficos
Autores principales: Han, Sunwoo, Williamson, Brian D., Fong, Youyi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607560/
https://www.ncbi.nlm.nih.gov/pubmed/34809631
http://dx.doi.org/10.1186/s12911-021-01688-3
_version_ 1784602584282038272
author Han, Sunwoo
Williamson, Brian D.
Fong, Youyi
author_facet Han, Sunwoo
Williamson, Brian D.
Fong, Youyi
author_sort Han, Sunwoo
collection PubMed
description BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01688-3.
format Online
Article
Text
id pubmed-8607560
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-86075602021-11-22 Improving random forest predictions in small datasets from two-phase sampling designs Han, Sunwoo Williamson, Brian D. Fong, Youyi BMC Med Inform Decis Mak Research Article BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01688-3. BioMed Central 2021-11-22 /pmc/articles/PMC8607560/ /pubmed/34809631 http://dx.doi.org/10.1186/s12911-021-01688-3 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Han, Sunwoo
Williamson, Brian D.
Fong, Youyi
Improving random forest predictions in small datasets from two-phase sampling designs
title Improving random forest predictions in small datasets from two-phase sampling designs
title_full Improving random forest predictions in small datasets from two-phase sampling designs
title_fullStr Improving random forest predictions in small datasets from two-phase sampling designs
title_full_unstemmed Improving random forest predictions in small datasets from two-phase sampling designs
title_short Improving random forest predictions in small datasets from two-phase sampling designs
title_sort improving random forest predictions in small datasets from two-phase sampling designs
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607560/
https://www.ncbi.nlm.nih.gov/pubmed/34809631
http://dx.doi.org/10.1186/s12911-021-01688-3
work_keys_str_mv AT hansunwoo improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns
AT williamsonbriand improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns
AT fongyouyi improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns