Cargando…

Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets

[Image: see text] Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein–ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a g...

Descripción completa

Detalles Bibliográficos
Autores principales: Kanakala, Ganesh Chandan, Aggarwal, Rishal, Nayar, Divya, Priyakumar, U. Deva
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9850481/
https://www.ncbi.nlm.nih.gov/pubmed/36687059
http://dx.doi.org/10.1021/acsomega.2c06781
Descripción
Sumario:[Image: see text] Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein–ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein–ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein–ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability.