Cargando…

Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction

O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcript...

Descripción completa

Detalles Bibliográficos
Autores principales: Pokharel, Suresh, Pratyush, Pawel, Ismail, Hamid D., Ma, Junfeng, KC, Dukka B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10650050/
https://www.ncbi.nlm.nih.gov/pubmed/37958983
http://dx.doi.org/10.3390/ijms242116000
_version_ 1785135690677223424
author Pokharel, Suresh
Pratyush, Pawel
Ismail, Hamid D.
Ma, Junfeng
KC, Dukka B.
author_facet Pokharel, Suresh
Pratyush, Pawel
Ismail, Hamid D.
Ma, Junfeng
KC, Dukka B.
author_sort Pokharel, Suresh
collection PubMed
description O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site’s web server and source code are publicly available to the community.
format Online
Article
Text
id pubmed-10650050
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-106500502023-11-06 Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction Pokharel, Suresh Pratyush, Pawel Ismail, Hamid D. Ma, Junfeng KC, Dukka B. Int J Mol Sci Article O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site’s web server and source code are publicly available to the community. MDPI 2023-11-06 /pmc/articles/PMC10650050/ /pubmed/37958983 http://dx.doi.org/10.3390/ijms242116000 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Pokharel, Suresh
Pratyush, Pawel
Ismail, Hamid D.
Ma, Junfeng
KC, Dukka B.
Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
title Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
title_full Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
title_fullStr Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
title_full_unstemmed Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
title_short Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
title_sort integrating embeddings from multiple protein language models to improve protein o-glcnac site prediction
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10650050/
https://www.ncbi.nlm.nih.gov/pubmed/37958983
http://dx.doi.org/10.3390/ijms242116000
work_keys_str_mv AT pokharelsuresh integratingembeddingsfrommultipleproteinlanguagemodelstoimproveproteinoglcnacsiteprediction
AT pratyushpawel integratingembeddingsfrommultipleproteinlanguagemodelstoimproveproteinoglcnacsiteprediction
AT ismailhamidd integratingembeddingsfrommultipleproteinlanguagemodelstoimproveproteinoglcnacsiteprediction
AT majunfeng integratingembeddingsfrommultipleproteinlanguagemodelstoimproveproteinoglcnacsiteprediction
AT kcdukkab integratingembeddingsfrommultipleproteinlanguagemodelstoimproveproteinoglcnacsiteprediction