Cargando…

Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding

Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable...

Descripción completa

Detalles Bibliográficos
Autores principales: Hadfield, Thomas E., Scantlebury, Jack, Deane, Charlotte M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10509074/
https://www.ncbi.nlm.nih.gov/pubmed/37726844
http://dx.doi.org/10.1186/s13321-023-00755-3
_version_ 1785107663048146944
author Hadfield, Thomas E.
Scantlebury, Jack
Deane, Charlotte M.
author_facet Hadfield, Thomas E.
Scantlebury, Jack
Deane, Charlotte M.
author_sort Hadfield, Thomas E.
collection PubMed
description Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data. Each ligand in the dataset is surrounded by a randomly sampled point cloud of pharmacophores, and the label assigned to the synthetic protein-ligand complex is determined by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at https://github.com/tomhadfield95/synthVS.
format Online
Article
Text
id pubmed-10509074
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-105090742023-09-21 Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding Hadfield, Thomas E. Scantlebury, Jack Deane, Charlotte M. J Cheminform Research Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data. Each ligand in the dataset is surrounded by a randomly sampled point cloud of pharmacophores, and the label assigned to the synthetic protein-ligand complex is determined by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at https://github.com/tomhadfield95/synthVS. Springer International Publishing 2023-09-19 /pmc/articles/PMC10509074/ /pubmed/37726844 http://dx.doi.org/10.1186/s13321-023-00755-3 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Hadfield, Thomas E.
Scantlebury, Jack
Deane, Charlotte M.
Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_full Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_fullStr Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_full_unstemmed Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_short Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_sort exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10509074/
https://www.ncbi.nlm.nih.gov/pubmed/37726844
http://dx.doi.org/10.1186/s13321-023-00755-3
work_keys_str_mv AT hadfieldthomase exploringtheabilityofmachinelearningbasedvirtualscreeningmodelstoidentifythefunctionalgroupsresponsibleforbinding
AT scantleburyjack exploringtheabilityofmachinelearningbasedvirtualscreeningmodelstoidentifythefunctionalgroupsresponsibleforbinding
AT deanecharlottem exploringtheabilityofmachinelearningbasedvirtualscreeningmodelstoidentifythefunctionalgroupsresponsibleforbinding