Cargando…

Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data

Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome o...

Descripción completa

Detalles Bibliográficos
Autores principales: Posani, Lorenzo, Rizzato, Francesca, Monasson, Rémi, Cocco, Simona
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10645369/
https://www.ncbi.nlm.nih.gov/pubmed/37883593
http://dx.doi.org/10.1371/journal.pcbi.1011521
_version_ 1785134739252838400
author Posani, Lorenzo
Rizzato, Francesca
Monasson, Rémi
Cocco, Simona
author_facet Posani, Lorenzo
Rizzato, Francesca
Monasson, Rémi
Cocco, Simona
author_sort Posani, Lorenzo
collection PubMed
description Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.
format Online
Article
Text
id pubmed-10645369
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-106453692023-10-26 Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona PLoS Comput Biol Research Article Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models. Public Library of Science 2023-10-26 /pmc/articles/PMC10645369/ /pubmed/37883593 http://dx.doi.org/10.1371/journal.pcbi.1011521 Text en © 2023 Posani et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Posani, Lorenzo
Rizzato, Francesca
Monasson, Rémi
Cocco, Simona
Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_full Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_fullStr Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_full_unstemmed Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_short Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_sort infer global, predict local: quantity-relevance trade-off in protein fitness predictions from sequence data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10645369/
https://www.ncbi.nlm.nih.gov/pubmed/37883593
http://dx.doi.org/10.1371/journal.pcbi.1011521
work_keys_str_mv AT posanilorenzo inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata
AT rizzatofrancesca inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata
AT monassonremi inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata
AT coccosimona inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata