Cargando…
Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction
Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-c...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cornell University
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10462179/ https://www.ncbi.nlm.nih.gov/pubmed/37645037 |
_version_ | 1785098003447545856 |
---|---|
author | Li, Jie Guan, Xingyi Zhang, Oufan Sun, Kunyang Wang, Yingze Bagni, Dorian Head-Gordon, Teresa |
author_facet | Li, Jie Guan, Xingyi Zhang, Oufan Sun, Kunyang Wang, Yingze Bagni, Dorian Head-Gordon, Teresa |
author_sort | Li, Jie |
collection | PubMed |
description | Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of new protein-ligand complexes. In this work we have carefully prepared a cleaned PDBBind data set of non-covalent binders that are split into training, validation, and test datasets to control for data leakage. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to new protein-ligand complexes. In particular we have formulated a new independent data set, BDB2020+, by matching high quality binding free energies from BindingDB with co-crystalized ligand-protein complexes from the PDB that have been deposited since 2020. Based on all the benchmark results, the retrained models using LP-PDBBind that rely on 3D information perform consistently among the best, with IGN especially being recommended for scoring and ranking applications for new protein-ligand systems. |
format | Online Article Text |
id | pubmed-10462179 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cornell University |
record_format | MEDLINE/PubMed |
spelling | pubmed-104621792023-08-29 Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction Li, Jie Guan, Xingyi Zhang, Oufan Sun, Kunyang Wang, Yingze Bagni, Dorian Head-Gordon, Teresa ArXiv Article Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of new protein-ligand complexes. In this work we have carefully prepared a cleaned PDBBind data set of non-covalent binders that are split into training, validation, and test datasets to control for data leakage. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to new protein-ligand complexes. In particular we have formulated a new independent data set, BDB2020+, by matching high quality binding free energies from BindingDB with co-crystalized ligand-protein complexes from the PDB that have been deposited since 2020. Based on all the benchmark results, the retrained models using LP-PDBBind that rely on 3D information perform consistently among the best, with IGN especially being recommended for scoring and ranking applications for new protein-ligand systems. Cornell University 2023-08-18 /pmc/articles/PMC10462179/ /pubmed/37645037 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Li, Jie Guan, Xingyi Zhang, Oufan Sun, Kunyang Wang, Yingze Bagni, Dorian Head-Gordon, Teresa Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction |
title | Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction |
title_full | Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction |
title_fullStr | Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction |
title_full_unstemmed | Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction |
title_short | Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction |
title_sort | leak proof pdbbind: a reorganized dataset of protein-ligand complexes for more generalizable binding affinity prediction |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10462179/ https://www.ncbi.nlm.nih.gov/pubmed/37645037 |
work_keys_str_mv | AT lijie leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction AT guanxingyi leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction AT zhangoufan leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction AT sunkunyang leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction AT wangyingze leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction AT bagnidorian leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction AT headgordonteresa leakproofpdbbindareorganizeddatasetofproteinligandcomplexesformoregeneralizablebindingaffinityprediction |