Cargando…

Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data

Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Posani, Lorenzo, Rizzato, Francesca, Monasson, Rémi, Cocco, Simona
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10645369/ https://www.ncbi.nlm.nih.gov/pubmed/37883593 http://dx.doi.org/10.1371/journal.pcbi.1011521

_version_	1785134739252838400
author	Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona
author_facet	Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona
author_sort	Posani, Lorenzo
collection	PubMed
description	Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.
format	Online Article Text
id	pubmed-10645369
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-106453692023-10-26 Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona PLoS Comput Biol Research Article Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models. Public Library of Science 2023-10-26 /pmc/articles/PMC10645369/ /pubmed/37883593 http://dx.doi.org/10.1371/journal.pcbi.1011521 Text en © 2023 Posani et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Posani, Lorenzo Rizzato, Francesca Monasson, Rémi Cocco, Simona Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title	Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_full	Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_fullStr	Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_full_unstemmed	Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_short	Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data
title_sort	infer global, predict local: quantity-relevance trade-off in protein fitness predictions from sequence data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10645369/ https://www.ncbi.nlm.nih.gov/pubmed/37883593 http://dx.doi.org/10.1371/journal.pcbi.1011521
work_keys_str_mv	AT posanilorenzo inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata AT rizzatofrancesca inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata AT monassonremi inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata AT coccosimona inferglobalpredictlocalquantityrelevancetradeoffinproteinfitnesspredictionsfromsequencedata

Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data

Ejemplares similares