Cargando…

Structure-Informed Protein Language Models are Robust Predictors for Variant Effects

Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sun, Yuanfei, Shen, Yang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Journal Experts 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10418537/ https://www.ncbi.nlm.nih.gov/pubmed/37577664 http://dx.doi.org/10.21203/rs.3.rs-3219092/v1

_version_	1785088287827820544
author	Sun, Yuanfei Shen, Yang
author_facet	Sun, Yuanfei Shen, Yang
author_sort	Sun, Yuanfei
collection	PubMed
description	Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family’s evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git.
format	Online Article Text
id	pubmed-10418537
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	American Journal Experts
record_format	MEDLINE/PubMed
spelling	pubmed-104185372023-08-12 Structure-Informed Protein Language Models are Robust Predictors for Variant Effects Sun, Yuanfei Shen, Yang Res Sq Article Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family’s evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git. American Journal Experts 2023-08-03 /pmc/articles/PMC10418537/ /pubmed/37577664 http://dx.doi.org/10.21203/rs.3.rs-3219092/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle	Article Sun, Yuanfei Shen, Yang Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title	Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_full	Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_fullStr	Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_full_unstemmed	Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_short	Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
title_sort	structure-informed protein language models are robust predictors for variant effects
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10418537/ https://www.ncbi.nlm.nih.gov/pubmed/37577664 http://dx.doi.org/10.21203/rs.3.rs-3219092/v1
work_keys_str_mv	AT sunyuanfei structureinformedproteinlanguagemodelsarerobustpredictorsforvarianteffects AT shenyang structureinformedproteinlanguagemodelsarerobustpredictorsforvarianteffects

Structure-Informed Protein Language Models are Robust Predictors for Variant Effects

Ejemplares similares