Cargando…
Improving random forest predictions in small datasets from two-phase sampling designs
BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcom...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607560/ https://www.ncbi.nlm.nih.gov/pubmed/34809631 http://dx.doi.org/10.1186/s12911-021-01688-3 |
_version_ | 1784602584282038272 |
---|---|
author | Han, Sunwoo Williamson, Brian D. Fong, Youyi |
author_facet | Han, Sunwoo Williamson, Brian D. Fong, Youyi |
author_sort | Han, Sunwoo |
collection | PubMed |
description | BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01688-3. |
format | Online Article Text |
id | pubmed-8607560 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-86075602021-11-22 Improving random forest predictions in small datasets from two-phase sampling designs Han, Sunwoo Williamson, Brian D. Fong, Youyi BMC Med Inform Decis Mak Research Article BACKGROUND: While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS: Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS: Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION: In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01688-3. BioMed Central 2021-11-22 /pmc/articles/PMC8607560/ /pubmed/34809631 http://dx.doi.org/10.1186/s12911-021-01688-3 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Han, Sunwoo Williamson, Brian D. Fong, Youyi Improving random forest predictions in small datasets from two-phase sampling designs |
title | Improving random forest predictions in small datasets from two-phase sampling designs |
title_full | Improving random forest predictions in small datasets from two-phase sampling designs |
title_fullStr | Improving random forest predictions in small datasets from two-phase sampling designs |
title_full_unstemmed | Improving random forest predictions in small datasets from two-phase sampling designs |
title_short | Improving random forest predictions in small datasets from two-phase sampling designs |
title_sort | improving random forest predictions in small datasets from two-phase sampling designs |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8607560/ https://www.ncbi.nlm.nih.gov/pubmed/34809631 http://dx.doi.org/10.1186/s12911-021-01688-3 |
work_keys_str_mv | AT hansunwoo improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns AT williamsonbriand improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns AT fongyouyi improvingrandomforestpredictionsinsmalldatasetsfromtwophasesamplingdesigns |