Cargando…
Efficient generative modeling of protein sequences using simple autoregressive models
Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly acc...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8490405/ https://www.ncbi.nlm.nih.gov/pubmed/34608136 http://dx.doi.org/10.1038/s41467-021-25756-4 |
_version_ | 1784578517355200512 |
---|---|
author | Trinquier, Jeanne Uguzzoni, Guido Pagnani, Andrea Zamponi, Francesco Weigt, Martin |
author_facet | Trinquier, Jeanne Uguzzoni, Guido Pagnani, Andrea Zamponi, Francesco Weigt, Martin |
author_sort | Trinquier, Jeanne |
collection | PubMed |
description | Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10(2) and 10(3)). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 10(68) possible sequences, which nevertheless constitute only the astronomically small fraction 10(−80) of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models. |
format | Online Article Text |
id | pubmed-8490405 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-84904052021-10-07 Efficient generative modeling of protein sequences using simple autoregressive models Trinquier, Jeanne Uguzzoni, Guido Pagnani, Andrea Zamponi, Francesco Weigt, Martin Nat Commun Article Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 10(2) and 10(3)). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 10(68) possible sequences, which nevertheless constitute only the astronomically small fraction 10(−80) of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models. Nature Publishing Group UK 2021-10-04 /pmc/articles/PMC8490405/ /pubmed/34608136 http://dx.doi.org/10.1038/s41467-021-25756-4 Text en © The Author(s) 2021, corrected publication 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Trinquier, Jeanne Uguzzoni, Guido Pagnani, Andrea Zamponi, Francesco Weigt, Martin Efficient generative modeling of protein sequences using simple autoregressive models |
title | Efficient generative modeling of protein sequences using simple autoregressive models |
title_full | Efficient generative modeling of protein sequences using simple autoregressive models |
title_fullStr | Efficient generative modeling of protein sequences using simple autoregressive models |
title_full_unstemmed | Efficient generative modeling of protein sequences using simple autoregressive models |
title_short | Efficient generative modeling of protein sequences using simple autoregressive models |
title_sort | efficient generative modeling of protein sequences using simple autoregressive models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8490405/ https://www.ncbi.nlm.nih.gov/pubmed/34608136 http://dx.doi.org/10.1038/s41467-021-25756-4 |
work_keys_str_mv | AT trinquierjeanne efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels AT uguzzoniguido efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels AT pagnaniandrea efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels AT zamponifrancesco efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels AT weigtmartin efficientgenerativemodelingofproteinsequencesusingsimpleautoregressivemodels |