Cargando…

Conformal prediction under feedback covariate shift for biomolecular design

Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to pred...

Descripción completa

Detalles Bibliográficos
Autores principales:	Fannjiang, Clara, Bates, Stephen, Angelopoulos, Anastasios N., Listgarten, Jennifer, Jordan, Michael I.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	National Academy of Sciences 2022
Materias:	Physical Sciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9618043/ https://www.ncbi.nlm.nih.gov/pubmed/36256807 http://dx.doi.org/10.1073/pnas.2204569119

_version_	1784820966762741760
author	Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I.
author_facet	Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I.
author_sort	Fannjiang, Clara
collection	PubMed
description	Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model’s predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting—one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model’s error on the test data—that is, the designed sequences—has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.
format	Online Article Text
id	pubmed-9618043
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	National Academy of Sciences
record_format	MEDLINE/PubMed
spelling	pubmed-96180432022-10-31 Conformal prediction under feedback covariate shift for biomolecular design Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I. Proc Natl Acad Sci U S A Physical Sciences Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model’s predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting—one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model’s error on the test data—that is, the designed sequences—has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty. National Academy of Sciences 2022-10-18 2022-10-25 /pmc/articles/PMC9618043/ /pubmed/36256807 http://dx.doi.org/10.1073/pnas.2204569119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle	Physical Sciences Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I. Conformal prediction under feedback covariate shift for biomolecular design
title	Conformal prediction under feedback covariate shift for biomolecular design
title_full	Conformal prediction under feedback covariate shift for biomolecular design
title_fullStr	Conformal prediction under feedback covariate shift for biomolecular design
title_full_unstemmed	Conformal prediction under feedback covariate shift for biomolecular design
title_short	Conformal prediction under feedback covariate shift for biomolecular design
title_sort	conformal prediction under feedback covariate shift for biomolecular design
topic	Physical Sciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9618043/ https://www.ncbi.nlm.nih.gov/pubmed/36256807 http://dx.doi.org/10.1073/pnas.2204569119
work_keys_str_mv	AT fannjiangclara conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT batesstephen conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT angelopoulosanastasiosn conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT listgartenjennifer conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT jordanmichaeli conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign

Conformal prediction under feedback covariate shift for biomolecular design

Ejemplares similares