Cargando…

Contextual protein and antibody encodings from equivariant graph transformers

The optimal residue identity at each position in a protein is determined by its structural, evolutionary, and functional context. We seek to learn the representation space of the optimal amino-acid residue in different structural contexts in proteins. Inspired by masked language modeling (MLM), our...

Descripción completa

Detalles Bibliográficos
Autores principales: Mahajan, Sai Pooja, Ruffolo, Jeffrey A., Gray, Jeffrey J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370091/
https://www.ncbi.nlm.nih.gov/pubmed/37503113
http://dx.doi.org/10.1101/2023.07.15.549154
_version_ 1785077887207997440
author Mahajan, Sai Pooja
Ruffolo, Jeffrey A.
Gray, Jeffrey J.
author_facet Mahajan, Sai Pooja
Ruffolo, Jeffrey A.
Gray, Jeffrey J.
author_sort Mahajan, Sai Pooja
collection PubMed
description The optimal residue identity at each position in a protein is determined by its structural, evolutionary, and functional context. We seek to learn the representation space of the optimal amino-acid residue in different structural contexts in proteins. Inspired by masked language modeling (MLM), our training aims to transduce learning of amino-acid labels from non-masked residues to masked residues in their structural environments and from general (e.g., a residue in a protein) to specific contexts (e.g., a residue at the interface of a protein or antibody complex). Our results on native sequence recovery and forward folding with AlphaFold2 suggest that the amino acid label for a protein residue may be determined from its structural context alone (i.e., without knowledge of the sequence labels of surrounding residues). We further find that the sequence space sampled from our masked models recapitulate the evolutionary sequence neighborhood of the wildtype sequence. Remarkably, the sequences conditioned on highly plastic structures recapitulate the conformational flexibility encoded in the structures. Furthermore, maximum-likelihood interfaces designed with masked models recapitulate wildtype binding energies for a wide range of protein interfaces and binding strengths. We also propose and compare fine-tuning strategies to train models for designing CDR loops of antibodies in the structural context of the antibody-antigen interface by leveraging structural databases for proteins, antibodies (synthetic and experimental) and protein-protein complexes. We show that pretraining on more general contexts improves native sequence recovery for antibody CDR loops, especially for the hypervariable CDR H3, while fine-tuning helps to preserve patterns observed in special contexts.
format Online
Article
Text
id pubmed-10370091
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-103700912023-07-27 Contextual protein and antibody encodings from equivariant graph transformers Mahajan, Sai Pooja Ruffolo, Jeffrey A. Gray, Jeffrey J. bioRxiv Article The optimal residue identity at each position in a protein is determined by its structural, evolutionary, and functional context. We seek to learn the representation space of the optimal amino-acid residue in different structural contexts in proteins. Inspired by masked language modeling (MLM), our training aims to transduce learning of amino-acid labels from non-masked residues to masked residues in their structural environments and from general (e.g., a residue in a protein) to specific contexts (e.g., a residue at the interface of a protein or antibody complex). Our results on native sequence recovery and forward folding with AlphaFold2 suggest that the amino acid label for a protein residue may be determined from its structural context alone (i.e., without knowledge of the sequence labels of surrounding residues). We further find that the sequence space sampled from our masked models recapitulate the evolutionary sequence neighborhood of the wildtype sequence. Remarkably, the sequences conditioned on highly plastic structures recapitulate the conformational flexibility encoded in the structures. Furthermore, maximum-likelihood interfaces designed with masked models recapitulate wildtype binding energies for a wide range of protein interfaces and binding strengths. We also propose and compare fine-tuning strategies to train models for designing CDR loops of antibodies in the structural context of the antibody-antigen interface by leveraging structural databases for proteins, antibodies (synthetic and experimental) and protein-protein complexes. We show that pretraining on more general contexts improves native sequence recovery for antibody CDR loops, especially for the hypervariable CDR H3, while fine-tuning helps to preserve patterns observed in special contexts. Cold Spring Harbor Laboratory 2023-07-29 /pmc/articles/PMC10370091/ /pubmed/37503113 http://dx.doi.org/10.1101/2023.07.15.549154 Text en https://creativecommons.org/licenses/by-nd/4.0/This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Mahajan, Sai Pooja
Ruffolo, Jeffrey A.
Gray, Jeffrey J.
Contextual protein and antibody encodings from equivariant graph transformers
title Contextual protein and antibody encodings from equivariant graph transformers
title_full Contextual protein and antibody encodings from equivariant graph transformers
title_fullStr Contextual protein and antibody encodings from equivariant graph transformers
title_full_unstemmed Contextual protein and antibody encodings from equivariant graph transformers
title_short Contextual protein and antibody encodings from equivariant graph transformers
title_sort contextual protein and antibody encodings from equivariant graph transformers
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370091/
https://www.ncbi.nlm.nih.gov/pubmed/37503113
http://dx.doi.org/10.1101/2023.07.15.549154
work_keys_str_mv AT mahajansaipooja contextualproteinandantibodyencodingsfromequivariantgraphtransformers
AT ruffolojeffreya contextualproteinandantibodyencodingsfromequivariantgraphtransformers
AT grayjeffreyj contextualproteinandantibodyencodingsfromequivariantgraphtransformers