Cargando…

Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction

Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-c...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Jie, Guan, Xingyi, Zhang, Oufan, Sun, Kunyang, Wang, Yingze, Bagni, Dorian, Head-Gordon, Teresa
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cornell University 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10462179/
https://www.ncbi.nlm.nih.gov/pubmed/37645037
_version_ 1785098003447545856
author Li, Jie
Guan, Xingyi
Zhang, Oufan
Sun, Kunyang
Wang, Yingze
Bagni, Dorian
Head-Gordon, Teresa
author_facet Li, Jie
Guan, Xingyi
Zhang, Oufan
Sun, Kunyang
Wang, Yingze
Bagni, Dorian
Head-Gordon, Teresa
author_sort Li, Jie
collection PubMed
description Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of new protein-ligand complexes. In this work we have carefully prepared a cleaned PDBBind data set of non-covalent binders that are split into training, validation, and test datasets to control for data leakage. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to new protein-ligand complexes. In particular we have formulated a new independent data set, BDB2020+, by matching high quality binding free energies from BindingDB with co-crystalized ligand-protein complexes from the PDB that have been deposited since 2020. Based on all the benchmark results, the retrained models using LP-PDBBind that rely on 3D information perform consistently among the best, with IGN especially being recommended for scoring and ranking applications for new protein-ligand systems.
format Online
Article
Text
id pubmed-10462179
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cornell University
record_format MEDLINE/PubMed
spelling pubmed-104621792023-08-29 Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction Li, Jie Guan, Xingyi Zhang, Oufan Sun, Kunyang Wang, Yingze Bagni, Dorian Head-Gordon, Teresa ArXiv Article Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of new protein-ligand complexes. In this work we have carefully prepared a cleaned PDBBind data set of non-covalent binders that are split into training, validation, and test datasets to control for data leakage. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to new protein-ligand complexes. In particular we have formulated a new independent data set, BDB2020+, by matching high quality binding free energies from BindingDB with co-crystalized ligand-protein complexes from the PDB that have been deposited since 2020. Based on all the benchmark results, the retrained models using LP-PDBBind that rely on 3D information perform consistently among the best, with IGN especially being recommended for scoring and ranking applications for new protein-ligand systems. Cornell University 2023-08-18 /pmc/articles/PMC10462179/ /pubmed/37645037 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Li, Jie
Guan, Xingyi
Zhang, Oufan
Sun, Kunyang
Wang, Yingze
Bagni, Dorian
Head-Gordon, Teresa
Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
title Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
title_full Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
title_fullStr Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
title_full_unstemmed Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
title_short Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
title_sort leak proof pdbbind: a reorganized dataset of protein-ligand complexes for more generalizable binding affinity prediction
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10462179/
https://www.ncbi.nlm.nih.gov/pubmed/37645037
work_keys_str_mv AT lijie leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction
AT guanxingyi leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction
AT zhangoufan leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction
AT sunkunyang leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction
AT wangyingze leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction
AT bagnidorian leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction
AT headgordonteresa leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction