Cargando…

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Measurements of protein–ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actua...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mervin, Lewis H., Trapotsi, Maria-Anna, Afzal, Avid M., Barrett, Ian P., Bender, Andreas, Engkvist, Ola
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8375213/ https://www.ncbi.nlm.nih.gov/pubmed/34412708 http://dx.doi.org/10.1186/s13321-021-00539-7

_version_	1783740275588333568
author	Mervin, Lewis H. Trapotsi, Maria-Anna Afzal, Avid M. Barrett, Ian P. Bender, Andreas Engkvist, Ola
author_facet	Mervin, Lewis H. Trapotsi, Maria-Anna Afzal, Avid M. Barrett, Ian P. Bender, Andreas Engkvist, Ola
author_sort	Mervin, Lewis H.
collection	PubMed
description	Measurements of protein–ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., K(i) versus IC(50) values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein–ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4–0.6 log units and when ideal probability estimates between 0.4–0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC(50) value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-021-00539-7.
format	Online Article Text
id	pubmed-8375213
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-83752132021-08-23 Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty Mervin, Lewis H. Trapotsi, Maria-Anna Afzal, Avid M. Barrett, Ian P. Bender, Andreas Engkvist, Ola J Cheminform Research Article Measurements of protein–ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., K(i) versus IC(50) values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein–ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4–0.6 log units and when ideal probability estimates between 0.4–0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC(50) value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-021-00539-7. Springer International Publishing 2021-08-19 /pmc/articles/PMC8375213/ /pubmed/34412708 http://dx.doi.org/10.1186/s13321-021-00539-7 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Mervin, Lewis H. Trapotsi, Maria-Anna Afzal, Avid M. Barrett, Ian P. Bender, Andreas Engkvist, Ola Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
title	Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
title_full	Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
title_fullStr	Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
title_full_unstemmed	Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
title_short	Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
title_sort	probabilistic random forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8375213/ https://www.ncbi.nlm.nih.gov/pubmed/34412708 http://dx.doi.org/10.1186/s13321-021-00539-7
work_keys_str_mv	AT mervinlewish probabilisticrandomforestimprovesbioactivitypredictionsclosetotheclassificationthresholdbytakingintoaccountexperimentaluncertainty AT trapotsimariaanna probabilisticrandomforestimprovesbioactivitypredictionsclosetotheclassificationthresholdbytakingintoaccountexperimentaluncertainty AT afzalavidm probabilisticrandomforestimprovesbioactivitypredictionsclosetotheclassificationthresholdbytakingintoaccountexperimentaluncertainty AT barrettianp probabilisticrandomforestimprovesbioactivitypredictionsclosetotheclassificationthresholdbytakingintoaccountexperimentaluncertainty AT benderandreas probabilisticrandomforestimprovesbioactivitypredictionsclosetotheclassificationthresholdbytakingintoaccountexperimentaluncertainty AT engkvistola probabilisticrandomforestimprovesbioactivitypredictionsclosetotheclassificationthresholdbytakingintoaccountexperimentaluncertainty

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty

Ejemplares similares