Cargando…
Structure-Informed Protein Language Models are Robust Predictors for Variant Effects
Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Journal Experts
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10418537/ https://www.ncbi.nlm.nih.gov/pubmed/37577664 http://dx.doi.org/10.21203/rs.3.rs-3219092/v1 |
_version_ | 1785088287827820544 |
---|---|
author | Sun, Yuanfei Shen, Yang |
author_facet | Sun, Yuanfei Shen, Yang |
author_sort | Sun, Yuanfei |
collection | PubMed |
description | Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family’s evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git. |
format | Online Article Text |
id | pubmed-10418537 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | American Journal Experts |
record_format | MEDLINE/PubMed |
spelling | pubmed-104185372023-08-12 Structure-Informed Protein Language Models are Robust Predictors for Variant Effects Sun, Yuanfei Shen, Yang Res Sq Article Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family’s evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git. American Journal Experts 2023-08-03 /pmc/articles/PMC10418537/ /pubmed/37577664 http://dx.doi.org/10.21203/rs.3.rs-3219092/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Sun, Yuanfei Shen, Yang Structure-Informed Protein Language Models are Robust Predictors for Variant Effects |
title | Structure-Informed Protein Language Models are Robust Predictors for Variant Effects |
title_full | Structure-Informed Protein Language Models are Robust Predictors for Variant Effects |
title_fullStr | Structure-Informed Protein Language Models are Robust Predictors for Variant Effects |
title_full_unstemmed | Structure-Informed Protein Language Models are Robust Predictors for Variant Effects |
title_short | Structure-Informed Protein Language Models are Robust Predictors for Variant Effects |
title_sort | structure-informed protein language models are robust predictors for variant effects |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10418537/ https://www.ncbi.nlm.nih.gov/pubmed/37577664 http://dx.doi.org/10.21203/rs.3.rs-3219092/v1 |
work_keys_str_mv | AT sunyuanfei structureinformedproteinlanguagemodelsarerobustpredictorsforvarianteffects AT shenyang structureinformedproteinlanguagemodelsarerobustpredictorsforvarianteffects |