Cargando…
Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome o...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10645369/ https://www.ncbi.nlm.nih.gov/pubmed/37883593 http://dx.doi.org/10.1371/journal.pcbi.1011521 |
_version_ | 1785134739252838400 |
---|---|
author | Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona |
author_facet | Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona |
author_sort | Posani, Lorenzo |
collection | PubMed |
description | Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models. |
format | Online Article Text |
id | pubmed-10645369 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-106453692023-10-26 Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona PLoS Comput Biol Research Article Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models. Public Library of Science 2023-10-26 /pmc/articles/PMC10645369/ /pubmed/37883593 http://dx.doi.org/10.1371/journal.pcbi.1011521 Text en © 2023 Posani et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data |
title | Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data |
title_full | Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data |
title_fullStr | Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data |
title_full_unstemmed | Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data |
title_short | Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data |
title_sort | infer global, predict local: quantity-relevance trade-off in protein fitness predictions from sequence data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10645369/ https://www.ncbi.nlm.nih.gov/pubmed/37883593 http://dx.doi.org/10.1371/journal.pcbi.1011521 |
work_keys_str_mv | AT posanilorenzo inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata AT rizzatofrancesca inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata AT monassonremi inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata AT coccosimona inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata |