Cargando…

Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression

Uncertainty measures estimate the reliability of a predictive model. Especially in the field of molecular property prediction as part of drug design, model reliability is crucial. Besides other techniques, Random Forests have a long tradition in machine learning related to chemoinformatics and are w...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dutschmann, Thomas-Martin, Baumann, Knut
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8588039/ https://www.ncbi.nlm.nih.gov/pubmed/34770921 http://dx.doi.org/10.3390/molecules26216514

_version_	1784598336579305472
author	Dutschmann, Thomas-Martin Baumann, Knut
author_facet	Dutschmann, Thomas-Martin Baumann, Knut
author_sort	Dutschmann, Thomas-Martin
collection	PubMed
description	Uncertainty measures estimate the reliability of a predictive model. Especially in the field of molecular property prediction as part of drug design, model reliability is crucial. Besides other techniques, Random Forests have a long tradition in machine learning related to chemoinformatics and are widely used. Random Forests consist of an ensemble of individual regression models, namely, decision trees and, therefore, provide an uncertainty measure already by construction. Regarding the disagreement of single-model predictions, a narrower distribution of predictions is interpreted as a higher reliability. The standard deviation of the decision tree ensemble predictions is the default uncertainty measure for Random Forests. Due to the increasing application of machine learning in drug design, there is a constant search for novel uncertainty measures that, ideally, outperform classical uncertainty criteria. When analyzing Random Forests, it appears obvious to consider the variance of the dependent variables within each terminal decision tree leaf to obtain predictive uncertainties. Hereby, predictions that arise from more leaves of high variance are considered less reliable. Expectedly, the number of such high-variance leaves yields a reasonable uncertainty measure. Depending on the dataset, it can also outperform ensemble uncertainties. However, small-scale comparisons, i.e., considering only a few datasets, are insufficient, since they are more prone to chance correlations. Therefore, large-scale estimations are required to make general claims about the performance of uncertainty measures. On several chemoinformatic regression datasets, high-variance leaves are compared to the standard deviation of ensemble predictions. It turns out that high-variance leaf uncertainty is meaningful, not superior to the default ensemble standard deviation. A brief possible explanation is offered.
format	Online Article Text
id	pubmed-8588039
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-85880392021-11-13 Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression Dutschmann, Thomas-Martin Baumann, Knut Molecules Article Uncertainty measures estimate the reliability of a predictive model. Especially in the field of molecular property prediction as part of drug design, model reliability is crucial. Besides other techniques, Random Forests have a long tradition in machine learning related to chemoinformatics and are widely used. Random Forests consist of an ensemble of individual regression models, namely, decision trees and, therefore, provide an uncertainty measure already by construction. Regarding the disagreement of single-model predictions, a narrower distribution of predictions is interpreted as a higher reliability. The standard deviation of the decision tree ensemble predictions is the default uncertainty measure for Random Forests. Due to the increasing application of machine learning in drug design, there is a constant search for novel uncertainty measures that, ideally, outperform classical uncertainty criteria. When analyzing Random Forests, it appears obvious to consider the variance of the dependent variables within each terminal decision tree leaf to obtain predictive uncertainties. Hereby, predictions that arise from more leaves of high variance are considered less reliable. Expectedly, the number of such high-variance leaves yields a reasonable uncertainty measure. Depending on the dataset, it can also outperform ensemble uncertainties. However, small-scale comparisons, i.e., considering only a few datasets, are insufficient, since they are more prone to chance correlations. Therefore, large-scale estimations are required to make general claims about the performance of uncertainty measures. On several chemoinformatic regression datasets, high-variance leaves are compared to the standard deviation of ensemble predictions. It turns out that high-variance leaf uncertainty is meaningful, not superior to the default ensemble standard deviation. A brief possible explanation is offered. MDPI 2021-10-28 /pmc/articles/PMC8588039/ /pubmed/34770921 http://dx.doi.org/10.3390/molecules26216514 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Dutschmann, Thomas-Martin Baumann, Knut Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title	Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_full	Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_fullStr	Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_full_unstemmed	Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_short	Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression
title_sort	evaluating high-variance leaves as uncertainty measure for random forest regression
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8588039/ https://www.ncbi.nlm.nih.gov/pubmed/34770921 http://dx.doi.org/10.3390/molecules26216514
work_keys_str_mv	AT dutschmannthomasmartin evaluatinghighvarianceleavesasuncertaintymeasureforrandomforestregression AT baumannknut evaluatinghighvarianceleavesasuncertaintymeasureforrandomforestregression

Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression

Ejemplares similares