Cargando…
cdsBERT - Extending Protein Language Models with Codon Awareness
Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explor...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516008/ https://www.ncbi.nlm.nih.gov/pubmed/37745387 http://dx.doi.org/10.1101/2023.09.15.558027 |
_version_ | 1785109054611259392 |
---|---|
author | Hallee, Logan Rafailidis, Nikolaos Gleghorn, Jason P. |
author_facet | Hallee, Logan Rafailidis, Nikolaos Gleghorn, Jason P. |
author_sort | Hallee, Logan |
collection | PubMed |
description | Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explore these findings by extending the traditional vocabulary of pLMs from amino acids to codons to encapsulate more information inside CoDing Sequences (CDS). We build upon traditional transfer learning techniques with a novel pipeline of token embedding matrix seeding, masked language modeling, and student-teacher knowledge distillation, called MELD. This transformed the pretrained ProtBERT into cdsBERT; a pLM with a codon vocabulary trained on a massive corpus of CDS. Interestingly, cdsBERT variants produced a highly biochemically relevant latent space, outperforming their amino acid-based counterparts on enzyme commission number prediction. Further analysis revealed that synonymous codon token embeddings moved distinctly in the embedding space, showcasing unique additions of information across broad phylogeny inside these traditionally “silent” mutations. This embedding movement correlated significantly with average usage bias across phylogeny. Future fine-tuned organism-specific codon pLMs may potentially have a more significant increase in codon usage fidelity. This work enables an exciting potential in using the codon vocabulary to improve current state-of-the-art structure and function prediction that necessitates the creation of a codon pLM foundation model alongside the addition of high-quality CDS to large-scale protein databases. |
format | Online Article Text |
id | pubmed-10516008 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-105160082023-09-23 cdsBERT - Extending Protein Language Models with Codon Awareness Hallee, Logan Rafailidis, Nikolaos Gleghorn, Jason P. bioRxiv Article Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explore these findings by extending the traditional vocabulary of pLMs from amino acids to codons to encapsulate more information inside CoDing Sequences (CDS). We build upon traditional transfer learning techniques with a novel pipeline of token embedding matrix seeding, masked language modeling, and student-teacher knowledge distillation, called MELD. This transformed the pretrained ProtBERT into cdsBERT; a pLM with a codon vocabulary trained on a massive corpus of CDS. Interestingly, cdsBERT variants produced a highly biochemically relevant latent space, outperforming their amino acid-based counterparts on enzyme commission number prediction. Further analysis revealed that synonymous codon token embeddings moved distinctly in the embedding space, showcasing unique additions of information across broad phylogeny inside these traditionally “silent” mutations. This embedding movement correlated significantly with average usage bias across phylogeny. Future fine-tuned organism-specific codon pLMs may potentially have a more significant increase in codon usage fidelity. This work enables an exciting potential in using the codon vocabulary to improve current state-of-the-art structure and function prediction that necessitates the creation of a codon pLM foundation model alongside the addition of high-quality CDS to large-scale protein databases. Cold Spring Harbor Laboratory 2023-09-17 /pmc/articles/PMC10516008/ /pubmed/37745387 http://dx.doi.org/10.1101/2023.09.15.558027 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator. |
spellingShingle | Article Hallee, Logan Rafailidis, Nikolaos Gleghorn, Jason P. cdsBERT - Extending Protein Language Models with Codon Awareness |
title | cdsBERT - Extending Protein Language Models with Codon Awareness |
title_full | cdsBERT - Extending Protein Language Models with Codon Awareness |
title_fullStr | cdsBERT - Extending Protein Language Models with Codon Awareness |
title_full_unstemmed | cdsBERT - Extending Protein Language Models with Codon Awareness |
title_short | cdsBERT - Extending Protein Language Models with Codon Awareness |
title_sort | cdsbert - extending protein language models with codon awareness |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516008/ https://www.ncbi.nlm.nih.gov/pubmed/37745387 http://dx.doi.org/10.1101/2023.09.15.558027 |
work_keys_str_mv | AT halleelogan cdsbertextendingproteinlanguagemodelswithcodonawareness AT rafailidisnikolaos cdsbertextendingproteinlanguagemodelswithcodonawareness AT gleghornjasonp cdsbertextendingproteinlanguagemodelswithcodonawareness |