Cargando…

Structure-Informed Protein Language Models are Robust Predictors for Variant Effects

Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional...

Descripción completa

Detalles Bibliográficos
Autores principales: Sun, Yuanfei, Shen, Yang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Journal Experts 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10418537/
https://www.ncbi.nlm.nih.gov/pubmed/37577664
http://dx.doi.org/10.21203/rs.3.rs-3219092/v1
_version_ 1785088287827820544
author Sun, Yuanfei
Shen, Yang
author_facet Sun, Yuanfei
Shen, Yang
author_sort Sun, Yuanfei
collection PubMed
description Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family’s evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git.
format Online
Article
Text
id pubmed-10418537
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Journal Experts
record_format MEDLINE/PubMed
spelling pubmed-104185372023-08-12 Structure-Informed Protein Language Models are Robust Predictors for Variant Effects Sun, Yuanfei Shen, Yang Res Sq Article Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family’s evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git. American Journal Experts 2023-08-03 /pmc/articles/PMC10418537/ /pubmed/37577664 http://dx.doi.org/10.21203/rs.3.rs-3219092/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Sun, Yuanfei
Shen, Yang
Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_full Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_fullStr Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_full_unstemmed Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_short Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_sort structure-informed protein language models are robust predictors for variant effects
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10418537/
https://www.ncbi.nlm.nih.gov/pubmed/37577664
http://dx.doi.org/10.21203/rs.3.rs-3219092/v1
work_keys_str_mv AT sunyuanfei structureinformedproteinlanguagemodelsarerobustpredictorsforvarianteffects
AT shenyang structureinformedproteinlanguagemodelsarerobustpredictorsforvarianteffects