Cargando…

Efficient generative modeling of protein sequences using simple autoregressive models

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly acc...

Descripción completa

Detalles Bibliográficos
Autores principales: Trinquier, Jeanne, Uguzzoni, Guido, Pagnani, Andrea, Zamponi, Francesco, Weigt, Martin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8490405/
https://www.ncbi.nlm.nih.gov/pubmed/34608136
http://dx.doi.org/10.1038/s41467-021-25756-4
_version_ 1784578517355200512
author Trinquier, Jeanne
Uguzzoni, Guido
Pagnani, Andrea
Zamponi, Francesco
Weigt, Martin
author_facet Trinquier, Jeanne
Uguzzoni, Guido
Pagnani, Andrea
Zamponi, Francesco
Weigt, Martin
author_sort Trinquier, Jeanne
collection PubMed
description Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10(2) and 10(3)). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 10(68) possible sequences, which nevertheless constitute only the astronomically small fraction 10(−80) of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.
format Online
Article
Text
id pubmed-8490405
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-84904052021-10-07 Efficient generative modeling of protein sequences using simple autoregressive models Trinquier, Jeanne Uguzzoni, Guido Pagnani, Andrea Zamponi, Francesco Weigt, Martin Nat Commun Article Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10(2) and 10(3)). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 10(68) possible sequences, which nevertheless constitute only the astronomically small fraction 10(−80) of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models. Nature Publishing Group UK 2021-10-04 /pmc/articles/PMC8490405/ /pubmed/34608136 http://dx.doi.org/10.1038/s41467-021-25756-4 Text en © The Author(s) 2021, corrected publication 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Trinquier, Jeanne
Uguzzoni, Guido
Pagnani, Andrea
Zamponi, Francesco
Weigt, Martin
Efficient generative modeling of protein sequences using simple autoregressive models
title Efficient generative modeling of protein sequences using simple autoregressive models
title_full Efficient generative modeling of protein sequences using simple autoregressive models
title_fullStr Efficient generative modeling of protein sequences using simple autoregressive models
title_full_unstemmed Efficient generative modeling of protein sequences using simple autoregressive models
title_short Efficient generative modeling of protein sequences using simple autoregressive models
title_sort efficient generative modeling of protein sequences using simple autoregressive models
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8490405/
https://www.ncbi.nlm.nih.gov/pubmed/34608136
http://dx.doi.org/10.1038/s41467-021-25756-4
work_keys_str_mv AT trinquierjeanne efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels
AT uguzzoniguido efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels
AT pagnaniandrea efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels
AT zamponifrancesco efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels
AT weigtmartin efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels