Cargando…

Machine Learning Models to Interrogate Proteome-wide Cysteine Ligandabilities

Machine learning (ML) identification of covalently ligandable sites may significantly accelerate targeted covalent inhibitor discoveries and expand the druggable proteome space. Here we report the development of the tree-based models and convolutional neural networks trained on a newly curated datab...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Ruibin, Clayton, Joseph, Shen, Mingzhe, Shen, Jana
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10473668/
https://www.ncbi.nlm.nih.gov/pubmed/37662346
http://dx.doi.org/10.1101/2023.08.17.553742
Descripción
Sumario:Machine learning (ML) identification of covalently ligandable sites may significantly accelerate targeted covalent inhibitor discoveries and expand the druggable proteome space. Here we report the development of the tree-based models and convolutional neural networks trained on a newly curated database (LigCys3D) of over 1,000 liganded cysteines in nearly 800 proteins represented by over 10,000 X-ray structures as reported in the protein data bank (PDB). The unseen tests yielded 94% AUC (area under the receiver operating characteristic curve), demonstrating the highly predictive power of the models. Interestingly, application to the proteins evaluated by the activity-based protein profiling (ABPP) experiments in cell lines gave a lower AUC of 72%. Analysis revealed significant discrepancies in the structural environment of the ligandable cysteines captured by X-ray crystallography and those determined by ABPP. This surprising finding warrants further investigations and may have implications for future drug discoveries. We discuss ways to improve the models and project future directions. Our work represents a first step towards the ML-led integration of big genome data, structure models, and chemoproteomic experiments to annotate the human proteome space for the next-generation drug discoveries.