Cargando…

Protein language models can capture protein quaternary state

BACKGROUND: Determining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determina...

Descripción completa

Detalles Bibliográficos
Autores principales:	Avraham, Orly, Tsaban, Tomer, Ben-Aharon, Ziv, Tsaban, Linoy, Schueler-Furman, Ora
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647083/ https://www.ncbi.nlm.nih.gov/pubmed/37964216 http://dx.doi.org/10.1186/s12859-023-05549-w

_version_	1785147496772665344
author	Avraham, Orly Tsaban, Tomer Ben-Aharon, Ziv Tsaban, Linoy Schueler-Furman, Ora
author_facet	Avraham, Orly Tsaban, Tomer Ben-Aharon, Ziv Tsaban, Linoy Schueler-Furman, Ora
author_sort	Avraham, Orly
collection	PubMed
description	BACKGROUND: Determining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. RESULTS: We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. CONCLUSIONS: QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.research.google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05549-w.
format	Online Article Text
id	pubmed-10647083
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-106470832023-11-14 Protein language models can capture protein quaternary state Avraham, Orly Tsaban, Tomer Ben-Aharon, Ziv Tsaban, Linoy Schueler-Furman, Ora BMC Bioinformatics Research BACKGROUND: Determining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. RESULTS: We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. CONCLUSIONS: QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.research.google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-023-05549-w. BioMed Central 2023-11-14 /pmc/articles/PMC10647083/ /pubmed/37964216 http://dx.doi.org/10.1186/s12859-023-05549-w Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Avraham, Orly Tsaban, Tomer Ben-Aharon, Ziv Tsaban, Linoy Schueler-Furman, Ora Protein language models can capture protein quaternary state
title	Protein language models can capture protein quaternary state
title_full	Protein language models can capture protein quaternary state
title_fullStr	Protein language models can capture protein quaternary state
title_full_unstemmed	Protein language models can capture protein quaternary state
title_short	Protein language models can capture protein quaternary state
title_sort	protein language models can capture protein quaternary state
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647083/ https://www.ncbi.nlm.nih.gov/pubmed/37964216 http://dx.doi.org/10.1186/s12859-023-05549-w
work_keys_str_mv	AT avrahamorly proteinlanguagemodelscancaptureproteinquaternarystate AT tsabantomer proteinlanguagemodelscancaptureproteinquaternarystate AT benaharonziv proteinlanguagemodelscancaptureproteinquaternarystate AT tsabanlinoy proteinlanguagemodelscancaptureproteinquaternarystate AT schuelerfurmanora proteinlanguagemodelscancaptureproteinquaternarystate

Protein language models can capture protein quaternary state

Ejemplares similares