Cargando…

End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins

Systematically discovering protein-ligand interactions across the entire human and pathogen genomes is critical in chemical genomics, protein function prediction, drug discovery, and many other areas. However, more than 90% of gene families remain “dark”—i.e., their small-molecule ligands are undisc...

Descripción completa

Detalles Bibliográficos
Autores principales: Cai, Tian, Xie, Li, Zhang, Shuo, Chen, Muge, He, Di, Badkul, Amitesh, Liu, Yang, Namballa, Hari Krishna, Dorogan, Michael, Harding, Wayne W., Mura, Cameron, Bourne, Philip E., Xie, Lei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9886305/
https://www.ncbi.nlm.nih.gov/pubmed/36652496
http://dx.doi.org/10.1371/journal.pcbi.1010851
_version_ 1784880107192582144
author Cai, Tian
Xie, Li
Zhang, Shuo
Chen, Muge
He, Di
Badkul, Amitesh
Liu, Yang
Namballa, Hari Krishna
Dorogan, Michael
Harding, Wayne W.
Mura, Cameron
Bourne, Philip E.
Xie, Lei
author_facet Cai, Tian
Xie, Li
Zhang, Shuo
Chen, Muge
He, Di
Badkul, Amitesh
Liu, Yang
Namballa, Hari Krishna
Dorogan, Michael
Harding, Wayne W.
Mura, Cameron
Bourne, Philip E.
Xie, Lei
author_sort Cai, Tian
collection PubMed
description Systematically discovering protein-ligand interactions across the entire human and pathogen genomes is critical in chemical genomics, protein function prediction, drug discovery, and many other areas. However, more than 90% of gene families remain “dark”—i.e., their small-molecule ligands are undiscovered due to experimental limitations or human/historical biases. Existing computational approaches typically fail when the dark protein differs from those with known ligands. To address this challenge, we have developed a deep learning framework, called PortalCG, which consists of four novel components: (i) a 3-dimensional ligand binding site enhanced sequence pre-training strategy to encode the evolutionary links between ligand-binding sites across gene families; (ii) an end-to-end pretraining-fine-tuning strategy to reduce the impact of inaccuracy of predicted structures on function predictions by recognizing the sequence-structure-function paradigm; (iii) a new out-of-cluster meta-learning algorithm that extracts and accumulates information learned from predicting ligands of distinct gene families (meta-data) and applies the meta-data to a dark gene family; and (iv) a stress model selection step, using different gene families in the test data from those in the training and development data sets to facilitate model deployment in a real-world scenario. In extensive and rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art techniques of machine learning and protein-ligand docking when applied to dark gene families, and demonstrated its generalization power for target identifications and compound screenings under out-of-distribution (OOD) scenarios. Furthermore, in an external validation for the multi-target compound screening, the performance of PortalCG surpassed the rational design from medicinal chemists. Our results also suggest that a differentiable sequence-structure-function deep learning framework, where protein structural information serves as an intermediate layer, could be superior to conventional methodology where predicted protein structures were used for the compound screening. We applied PortalCG to two case studies to exemplify its potential in drug discovery: designing selective dual-antagonists of dopamine receptors for the treatment of opioid use disorder (OUD), and illuminating the understudied human genome for target diseases that do not yet have effective and safe therapeutics. Our results suggested that PortalCG is a viable solution to the OOD problem in exploring understudied regions of protein functional space.
format Online
Article
Text
id pubmed-9886305
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-98863052023-01-31 End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins Cai, Tian Xie, Li Zhang, Shuo Chen, Muge He, Di Badkul, Amitesh Liu, Yang Namballa, Hari Krishna Dorogan, Michael Harding, Wayne W. Mura, Cameron Bourne, Philip E. Xie, Lei PLoS Comput Biol Research Article Systematically discovering protein-ligand interactions across the entire human and pathogen genomes is critical in chemical genomics, protein function prediction, drug discovery, and many other areas. However, more than 90% of gene families remain “dark”—i.e., their small-molecule ligands are undiscovered due to experimental limitations or human/historical biases. Existing computational approaches typically fail when the dark protein differs from those with known ligands. To address this challenge, we have developed a deep learning framework, called PortalCG, which consists of four novel components: (i) a 3-dimensional ligand binding site enhanced sequence pre-training strategy to encode the evolutionary links between ligand-binding sites across gene families; (ii) an end-to-end pretraining-fine-tuning strategy to reduce the impact of inaccuracy of predicted structures on function predictions by recognizing the sequence-structure-function paradigm; (iii) a new out-of-cluster meta-learning algorithm that extracts and accumulates information learned from predicting ligands of distinct gene families (meta-data) and applies the meta-data to a dark gene family; and (iv) a stress model selection step, using different gene families in the test data from those in the training and development data sets to facilitate model deployment in a real-world scenario. In extensive and rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art techniques of machine learning and protein-ligand docking when applied to dark gene families, and demonstrated its generalization power for target identifications and compound screenings under out-of-distribution (OOD) scenarios. Furthermore, in an external validation for the multi-target compound screening, the performance of PortalCG surpassed the rational design from medicinal chemists. Our results also suggest that a differentiable sequence-structure-function deep learning framework, where protein structural information serves as an intermediate layer, could be superior to conventional methodology where predicted protein structures were used for the compound screening. We applied PortalCG to two case studies to exemplify its potential in drug discovery: designing selective dual-antagonists of dopamine receptors for the treatment of opioid use disorder (OUD), and illuminating the understudied human genome for target diseases that do not yet have effective and safe therapeutics. Our results suggested that PortalCG is a viable solution to the OOD problem in exploring understudied regions of protein functional space. Public Library of Science 2023-01-18 /pmc/articles/PMC9886305/ /pubmed/36652496 http://dx.doi.org/10.1371/journal.pcbi.1010851 Text en © 2023 Cai et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Cai, Tian
Xie, Li
Zhang, Shuo
Chen, Muge
He, Di
Badkul, Amitesh
Liu, Yang
Namballa, Hari Krishna
Dorogan, Michael
Harding, Wayne W.
Mura, Cameron
Bourne, Philip E.
Xie, Lei
End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins
title End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins
title_full End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins
title_fullStr End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins
title_full_unstemmed End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins
title_short End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins
title_sort end-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9886305/
https://www.ncbi.nlm.nih.gov/pubmed/36652496
http://dx.doi.org/10.1371/journal.pcbi.1010851
work_keys_str_mv AT caitian endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT xieli endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT zhangshuo endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT chenmuge endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT hedi endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT badkulamitesh endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT liuyang endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT namballaharikrishna endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT doroganmichael endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT hardingwaynew endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT muracameron endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT bournephilipe endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins
AT xielei endtoendsequencestructurefunctionmetalearningpredictsgenomewidechemicalproteininteractionsfordarkproteins