Cargando…

Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets

[Image: see text] Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein–ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a g...

Descripción completa

Detalles Bibliográficos
Autores principales: Kanakala, Ganesh Chandan, Aggarwal, Rishal, Nayar, Divya, Priyakumar, U. Deva
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9850481/
https://www.ncbi.nlm.nih.gov/pubmed/36687059
http://dx.doi.org/10.1021/acsomega.2c06781
_version_ 1784872195987603456
author Kanakala, Ganesh Chandan
Aggarwal, Rishal
Nayar, Divya
Priyakumar, U. Deva
author_facet Kanakala, Ganesh Chandan
Aggarwal, Rishal
Nayar, Divya
Priyakumar, U. Deva
author_sort Kanakala, Ganesh Chandan
collection PubMed
description [Image: see text] Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein–ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein–ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein–ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability.
format Online
Article
Text
id pubmed-9850481
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-98504812023-01-20 Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets Kanakala, Ganesh Chandan Aggarwal, Rishal Nayar, Divya Priyakumar, U. Deva ACS Omega [Image: see text] Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein–ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein–ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein–ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability. American Chemical Society 2023-01-05 /pmc/articles/PMC9850481/ /pubmed/36687059 http://dx.doi.org/10.1021/acsomega.2c06781 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by-nc-nd/4.0/Permits non-commercial access and re-use, provided that author attribution and integrity are maintained; but does not permit creation of adaptations or other derivative works (https://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Kanakala, Ganesh Chandan
Aggarwal, Rishal
Nayar, Divya
Priyakumar, U. Deva
Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets
title Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets
title_full Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets
title_fullStr Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets
title_full_unstemmed Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets
title_short Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets
title_sort latent biases in machine learning models for predicting binding affinities using popular data sets
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9850481/
https://www.ncbi.nlm.nih.gov/pubmed/36687059
http://dx.doi.org/10.1021/acsomega.2c06781
work_keys_str_mv AT kanakalaganeshchandan latentbiasesinmachinelearningmodelsforpredictingbindingaffinitiesusingpopulardatasets
AT aggarwalrishal latentbiasesinmachinelearningmodelsforpredictingbindingaffinitiesusingpopulardatasets
AT nayardivya latentbiasesinmachinelearningmodelsforpredictingbindingaffinitiesusingpopulardatasets
AT priyakumarudeva latentbiasesinmachinelearningmodelsforpredictingbindingaffinitiesusingpopulardatasets