Cargando…

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold’s EvoFormer, take multiple sequence alignments (MSAs) of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lupo, Umberto, Sgarbossa, Damiano, Bitbol, Anne-Florence
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9588007/ https://www.ncbi.nlm.nih.gov/pubmed/36273003 http://dx.doi.org/10.1038/s41467-022-34032-y

_version_	1784814031303868416
author	Lupo, Umberto Sgarbossa, Damiano Bitbol, Anne-Florence
author_facet	Lupo, Umberto Sgarbossa, Damiano Bitbol, Anne-Florence
author_sort	Lupo, Umberto
collection	PubMed
description	Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold’s EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer’s row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer’s column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.
format	Online Article Text
id	pubmed-9588007
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-95880072022-10-24 Protein language models trained on multiple sequence alignments learn phylogenetic relationships Lupo, Umberto Sgarbossa, Damiano Bitbol, Anne-Florence Nat Commun Article Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold’s EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer’s row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer’s column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models. Nature Publishing Group UK 2022-10-22 /pmc/articles/PMC9588007/ /pubmed/36273003 http://dx.doi.org/10.1038/s41467-022-34032-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Lupo, Umberto Sgarbossa, Damiano Bitbol, Anne-Florence Protein language models trained on multiple sequence alignments learn phylogenetic relationships
title	Protein language models trained on multiple sequence alignments learn phylogenetic relationships
title_full	Protein language models trained on multiple sequence alignments learn phylogenetic relationships
title_fullStr	Protein language models trained on multiple sequence alignments learn phylogenetic relationships
title_full_unstemmed	Protein language models trained on multiple sequence alignments learn phylogenetic relationships
title_short	Protein language models trained on multiple sequence alignments learn phylogenetic relationships
title_sort	protein language models trained on multiple sequence alignments learn phylogenetic relationships
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9588007/ https://www.ncbi.nlm.nih.gov/pubmed/36273003 http://dx.doi.org/10.1038/s41467-022-34032-y
work_keys_str_mv	AT lupoumberto proteinlanguagemodelstrainedonmultiplesequencealignmentslearnphylogeneticrelationships AT sgarbossadamiano proteinlanguagemodelstrainedonmultiplesequencealignmentslearnphylogeneticrelationships AT bitbolanneflorence proteinlanguagemodelstrainedonmultiplesequencealignmentslearnphylogeneticrelationships

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Ejemplares similares