Cargando…

Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets

Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBb...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yang, Jincai, Shen, Cheng, Huang, Niu
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2020
Materias:	Pharmacology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7052818/ https://www.ncbi.nlm.nih.gov/pubmed/32161539 http://dx.doi.org/10.3389/fphar.2020.00069

_version_	1783502921611083776
author	Yang, Jincai Shen, Cheng Huang, Niu
author_facet	Yang, Jincai Shen, Cheng Huang, Niu
author_sort	Yang, Jincai
collection	PubMed
description	Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examined the model performance of atomic convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R(2) of 0.73 between experimental and predicted binding affinities. Strikingly, the ACNN models did not require learning the essential protein-ligand interactions in complex structures and achieved similar performance even on datasets containing only ligand structures or only protein structures, while data splitting based on similarity clustering (protein sequence or ligand scaffold) significantly reduced the model performance. We also identified the property and topology biases in the DUD-E dataset which led to the artificially increased enrichment performance of virtual screening. The property bias in DUD-E was reduced by enforcing the more stringent ligand property matching rules, while the topology bias still exists due to the use of molecular fingerprint similarity as a decoy selection criterion. Therefore, we believe that sufficiently large and unbiased datasets are desirable for training robust AI models to accurately predict protein-ligand interactions.
format	Online Article Text
id	pubmed-7052818
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-70528182020-03-11 Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets Yang, Jincai Shen, Cheng Huang, Niu Front Pharmacol Pharmacology Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examined the model performance of atomic convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R(2) of 0.73 between experimental and predicted binding affinities. Strikingly, the ACNN models did not require learning the essential protein-ligand interactions in complex structures and achieved similar performance even on datasets containing only ligand structures or only protein structures, while data splitting based on similarity clustering (protein sequence or ligand scaffold) significantly reduced the model performance. We also identified the property and topology biases in the DUD-E dataset which led to the artificially increased enrichment performance of virtual screening. The property bias in DUD-E was reduced by enforcing the more stringent ligand property matching rules, while the topology bias still exists due to the use of molecular fingerprint similarity as a decoy selection criterion. Therefore, we believe that sufficiently large and unbiased datasets are desirable for training robust AI models to accurately predict protein-ligand interactions. Frontiers Media S.A. 2020-02-25 /pmc/articles/PMC7052818/ /pubmed/32161539 http://dx.doi.org/10.3389/fphar.2020.00069 Text en Copyright © 2020 Yang, Shen and Huang http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Pharmacology Yang, Jincai Shen, Cheng Huang, Niu Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets
title	Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets
title_full	Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets
title_fullStr	Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets
title_full_unstemmed	Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets
title_short	Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets
title_sort	predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets
topic	Pharmacology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7052818/ https://www.ncbi.nlm.nih.gov/pubmed/32161539 http://dx.doi.org/10.3389/fphar.2020.00069
work_keys_str_mv	AT yangjincai predictingorpretendingartificialintelligenceforproteinligandinteractionslackofsufficientlylargeandunbiaseddatasets AT shencheng predictingorpretendingartificialintelligenceforproteinligandinteractionslackofsufficientlylargeandunbiaseddatasets AT huangniu predictingorpretendingartificialintelligenceforproteinligandinteractionslackofsufficientlylargeandunbiaseddatasets

Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets

Ejemplares similares