Cargando…

On the sparsity of fitness functions and implications for learning

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Brookes, David H., Aghazadeh, Amirali, Listgarten, Jennifer
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	National Academy of Sciences 2021
Materias:	Biological Sciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8740588/ https://www.ncbi.nlm.nih.gov/pubmed/34937698 http://dx.doi.org/10.1073/pnas.2109649118

_version_	1784629340793733120
author	Brookes, David H. Aghazadeh, Amirali Listgarten, Jennifer
author_facet	Brookes, David H. Aghazadeh, Amirali Listgarten, Jennifer
author_sort	Brookes, David H.
collection	PubMed
description	Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model’s interpretable parameters—sequence length, alphabet size, and assumed interactions between sequence positions—on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.
format	Online Article Text
id	pubmed-8740588
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	National Academy of Sciences
record_format	MEDLINE/PubMed
spelling	pubmed-87405882022-06-22 On the sparsity of fitness functions and implications for learning Brookes, David H. Aghazadeh, Amirali Listgarten, Jennifer Proc Natl Acad Sci U S A Biological Sciences Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model’s interpretable parameters—sequence length, alphabet size, and assumed interactions between sequence positions—on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information. National Academy of Sciences 2021-12-22 2022-01-04 /pmc/articles/PMC8740588/ /pubmed/34937698 http://dx.doi.org/10.1073/pnas.2109649118 Text en Copyright © 2021 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle	Biological Sciences Brookes, David H. Aghazadeh, Amirali Listgarten, Jennifer On the sparsity of fitness functions and implications for learning
title	On the sparsity of fitness functions and implications for learning
title_full	On the sparsity of fitness functions and implications for learning
title_fullStr	On the sparsity of fitness functions and implications for learning
title_full_unstemmed	On the sparsity of fitness functions and implications for learning
title_short	On the sparsity of fitness functions and implications for learning
title_sort	on the sparsity of fitness functions and implications for learning
topic	Biological Sciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8740588/ https://www.ncbi.nlm.nih.gov/pubmed/34937698 http://dx.doi.org/10.1073/pnas.2109649118
work_keys_str_mv	AT brookesdavidh onthesparsityoffitnessfunctionsandimplicationsforlearning AT aghazadehamirali onthesparsityoffitnessfunctionsandimplicationsforlearning AT listgartenjennifer onthesparsityoffitnessfunctionsandimplicationsforlearning

On the sparsity of fitness functions and implications for learning

Ejemplares similares