Cargando…

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed,...

Descripción completa

Detalles Bibliográficos
Autores principales: Saar, Kadi L., Morgunov, Alexey S., Qi, Runzhang, Arter, William E., Krainer, Georg, Lee, Alpha A., Knowles, Tuomas P. J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8053968/
https://www.ncbi.nlm.nih.gov/pubmed/33827920
http://dx.doi.org/10.1073/pnas.2019053118
_version_ 1783680223520227328
author Saar, Kadi L.
Morgunov, Alexey S.
Qi, Runzhang
Arter, William E.
Krainer, Georg
Lee, Alpha A.
Knowles, Tuomas P. J.
author_facet Saar, Kadi L.
Morgunov, Alexey S.
Qi, Runzhang
Arter, William E.
Krainer, Georg
Lee, Alpha A.
Knowles, Tuomas P. J.
author_sort Saar, Kadi L.
collection PubMed
description Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine-learning models for predicting protein liquid–liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embeddings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch.cam.ac.uk/.
format Online
Article
Text
id pubmed-8053968
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-80539682021-05-04 Learning the molecular grammar of protein condensates from sequence determinants and embeddings Saar, Kadi L. Morgunov, Alexey S. Qi, Runzhang Arter, William E. Krainer, Georg Lee, Alpha A. Knowles, Tuomas P. J. Proc Natl Acad Sci U S A Biological Sciences Intracellular phase separation of proteins into biomolecular condensates is increasingly recognized as a process with a key role in cellular compartmentalization and regulation. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed, with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behavior and further constructed machine-learning models for predicting protein liquid–liquid phase separation (LLPS). Our analysis highlighted that LLPS-prone proteins are more disordered, less hydrophobic, and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database and that they show a fine balance in their relative content of polar and hydrophobic residues. To further learn in a hypothesis-free manner the sequence features underpinning LLPS, we trained a neural network-based language model and found that a classifier constructed on such embeddings learned the underlying principles of phase behavior at a comparable accuracy to a classifier that used knowledge-based features. By combining knowledge-based features with unsupervised embeddings, we generated an integrated model that distinguished LLPS-prone sequences both from structured proteins and from unstructured proteins with a lower LLPS propensity and further identified such sequences from the human proteome at a high accuracy. These results provide a platform rooted in molecular principles for understanding protein phase behavior. The predictor, termed DeePhase, is accessible from https://deephase.ch.cam.ac.uk/. National Academy of Sciences 2021-04-13 2021-04-07 /pmc/articles/PMC8053968/ /pubmed/33827920 http://dx.doi.org/10.1073/pnas.2019053118 Text en Copyright © 2021 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Biological Sciences
Saar, Kadi L.
Morgunov, Alexey S.
Qi, Runzhang
Arter, William E.
Krainer, Georg
Lee, Alpha A.
Knowles, Tuomas P. J.
Learning the molecular grammar of protein condensates from sequence determinants and embeddings
title Learning the molecular grammar of protein condensates from sequence determinants and embeddings
title_full Learning the molecular grammar of protein condensates from sequence determinants and embeddings
title_fullStr Learning the molecular grammar of protein condensates from sequence determinants and embeddings
title_full_unstemmed Learning the molecular grammar of protein condensates from sequence determinants and embeddings
title_short Learning the molecular grammar of protein condensates from sequence determinants and embeddings
title_sort learning the molecular grammar of protein condensates from sequence determinants and embeddings
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8053968/
https://www.ncbi.nlm.nih.gov/pubmed/33827920
http://dx.doi.org/10.1073/pnas.2019053118
work_keys_str_mv AT saarkadil learningthemoleculargrammarofproteincondensatesfromsequencedeterminantsandembeddings
AT morgunovalexeys learningthemoleculargrammarofproteincondensatesfromsequencedeterminantsandembeddings
AT qirunzhang learningthemoleculargrammarofproteincondensatesfromsequencedeterminantsandembeddings
AT arterwilliame learningthemoleculargrammarofproteincondensatesfromsequencedeterminantsandembeddings
AT krainergeorg learningthemoleculargrammarofproteincondensatesfromsequencedeterminantsandembeddings
AT leealphaa learningthemoleculargrammarofproteincondensatesfromsequencedeterminantsandembeddings
AT knowlestuomaspj learningthemoleculargrammarofproteincondensatesfromsequencedeterminantsandembeddings