Cargando…

Generative power of a protein language model trained on multiple sequence alignments

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families....

Descripción completa

Detalles Bibliográficos
Autores principales: Sgarbossa, Damiano, Lupo, Umberto, Bitbol, Anne-Florence
Formato: Online Artículo Texto
Lenguaje:English
Publicado: eLife Sciences Publications, Ltd 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038667/
https://www.ncbi.nlm.nih.gov/pubmed/36734516
http://dx.doi.org/10.7554/eLife.79854
_version_ 1784912130864054272
author Sgarbossa, Damiano
Lupo, Umberto
Bitbol, Anne-Florence
author_facet Sgarbossa, Damiano
Lupo, Umberto
Bitbol, Anne-Florence
author_sort Sgarbossa, Damiano
collection PubMed
description Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.
format Online
Article
Text
id pubmed-10038667
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher eLife Sciences Publications, Ltd
record_format MEDLINE/PubMed
spelling pubmed-100386672023-03-25 Generative power of a protein language model trained on multiple sequence alignments Sgarbossa, Damiano Lupo, Umberto Bitbol, Anne-Florence eLife Computational and Systems Biology Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design. eLife Sciences Publications, Ltd 2023-02-03 /pmc/articles/PMC10038667/ /pubmed/36734516 http://dx.doi.org/10.7554/eLife.79854 Text en © 2023, Sgarbossa et al https://creativecommons.org/licenses/by/4.0/This article is distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use and redistribution provided that the original author and source are credited.
spellingShingle Computational and Systems Biology
Sgarbossa, Damiano
Lupo, Umberto
Bitbol, Anne-Florence
Generative power of a protein language model trained on multiple sequence alignments
title Generative power of a protein language model trained on multiple sequence alignments
title_full Generative power of a protein language model trained on multiple sequence alignments
title_fullStr Generative power of a protein language model trained on multiple sequence alignments
title_full_unstemmed Generative power of a protein language model trained on multiple sequence alignments
title_short Generative power of a protein language model trained on multiple sequence alignments
title_sort generative power of a protein language model trained on multiple sequence alignments
topic Computational and Systems Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10038667/
https://www.ncbi.nlm.nih.gov/pubmed/36734516
http://dx.doi.org/10.7554/eLife.79854
work_keys_str_mv AT sgarbossadamiano generativepowerofaproteinlanguagemodeltrainedonmultiplesequencealignments
AT lupoumberto generativepowerofaproteinlanguagemodeltrainedonmultiplesequencealignments
AT bitbolanneflorence generativepowerofaproteinlanguagemodeltrainedonmultiplesequencealignments