Cargando…

cdsBERT - Extending Protein Language Models with Codon Awareness

Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explor...

Descripción completa

Detalles Bibliográficos
Autores principales: Hallee, Logan, Rafailidis, Nikolaos, Gleghorn, Jason P.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516008/
https://www.ncbi.nlm.nih.gov/pubmed/37745387
http://dx.doi.org/10.1101/2023.09.15.558027
_version_ 1785109054611259392
author Hallee, Logan
Rafailidis, Nikolaos
Gleghorn, Jason P.
author_facet Hallee, Logan
Rafailidis, Nikolaos
Gleghorn, Jason P.
author_sort Hallee, Logan
collection PubMed
description Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explore these findings by extending the traditional vocabulary of pLMs from amino acids to codons to encapsulate more information inside CoDing Sequences (CDS). We build upon traditional transfer learning techniques with a novel pipeline of token embedding matrix seeding, masked language modeling, and student-teacher knowledge distillation, called MELD. This transformed the pretrained ProtBERT into cdsBERT; a pLM with a codon vocabulary trained on a massive corpus of CDS. Interestingly, cdsBERT variants produced a highly biochemically relevant latent space, outperforming their amino acid-based counterparts on enzyme commission number prediction. Further analysis revealed that synonymous codon token embeddings moved distinctly in the embedding space, showcasing unique additions of information across broad phylogeny inside these traditionally “silent” mutations. This embedding movement correlated significantly with average usage bias across phylogeny. Future fine-tuned organism-specific codon pLMs may potentially have a more significant increase in codon usage fidelity. This work enables an exciting potential in using the codon vocabulary to improve current state-of-the-art structure and function prediction that necessitates the creation of a codon pLM foundation model alongside the addition of high-quality CDS to large-scale protein databases.
format Online
Article
Text
id pubmed-10516008
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-105160082023-09-23 cdsBERT - Extending Protein Language Models with Codon Awareness Hallee, Logan Rafailidis, Nikolaos Gleghorn, Jason P. bioRxiv Article Recent advancements in Protein Language Models (pLMs) have enabled high-throughput analysis of proteins through primary sequence alone. At the same time, newfound evidence illustrates that codon usage bias is remarkably predictive and can even change the final structure of a protein. Here, we explore these findings by extending the traditional vocabulary of pLMs from amino acids to codons to encapsulate more information inside CoDing Sequences (CDS). We build upon traditional transfer learning techniques with a novel pipeline of token embedding matrix seeding, masked language modeling, and student-teacher knowledge distillation, called MELD. This transformed the pretrained ProtBERT into cdsBERT; a pLM with a codon vocabulary trained on a massive corpus of CDS. Interestingly, cdsBERT variants produced a highly biochemically relevant latent space, outperforming their amino acid-based counterparts on enzyme commission number prediction. Further analysis revealed that synonymous codon token embeddings moved distinctly in the embedding space, showcasing unique additions of information across broad phylogeny inside these traditionally “silent” mutations. This embedding movement correlated significantly with average usage bias across phylogeny. Future fine-tuned organism-specific codon pLMs may potentially have a more significant increase in codon usage fidelity. This work enables an exciting potential in using the codon vocabulary to improve current state-of-the-art structure and function prediction that necessitates the creation of a codon pLM foundation model alongside the addition of high-quality CDS to large-scale protein databases. Cold Spring Harbor Laboratory 2023-09-17 /pmc/articles/PMC10516008/ /pubmed/37745387 http://dx.doi.org/10.1101/2023.09.15.558027 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Hallee, Logan
Rafailidis, Nikolaos
Gleghorn, Jason P.
cdsBERT - Extending Protein Language Models with Codon Awareness
title cdsBERT - Extending Protein Language Models with Codon Awareness
title_full cdsBERT - Extending Protein Language Models with Codon Awareness
title_fullStr cdsBERT - Extending Protein Language Models with Codon Awareness
title_full_unstemmed cdsBERT - Extending Protein Language Models with Codon Awareness
title_short cdsBERT - Extending Protein Language Models with Codon Awareness
title_sort cdsbert - extending protein language models with codon awareness
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10516008/
https://www.ncbi.nlm.nih.gov/pubmed/37745387
http://dx.doi.org/10.1101/2023.09.15.558027
work_keys_str_mv AT halleelogan cdsbertextendingproteinlanguagemodelswithcodonawareness
AT rafailidisnikolaos cdsbertextendingproteinlanguagemodelswithcodonawareness
AT gleghornjasonp cdsbertextendingproteinlanguagemodelswithcodonawareness