Cargando…

Conformal prediction under feedback covariate shift for biomolecular design

Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to pred...

Descripción completa

Detalles Bibliográficos
Autores principales: Fannjiang, Clara, Bates, Stephen, Angelopoulos, Anastasios N., Listgarten, Jennifer, Jordan, Michael I.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9618043/
https://www.ncbi.nlm.nih.gov/pubmed/36256807
http://dx.doi.org/10.1073/pnas.2204569119
_version_ 1784820966762741760
author Fannjiang, Clara
Bates, Stephen
Angelopoulos, Anastasios N.
Listgarten, Jennifer
Jordan, Michael I.
author_facet Fannjiang, Clara
Bates, Stephen
Angelopoulos, Anastasios N.
Listgarten, Jennifer
Jordan, Michael I.
author_sort Fannjiang, Clara
collection PubMed
description Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model’s predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting—one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model’s error on the test data—that is, the designed sequences—has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.
format Online
Article
Text
id pubmed-9618043
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-96180432022-10-31 Conformal prediction under feedback covariate shift for biomolecular design Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I. Proc Natl Acad Sci U S A Physical Sciences Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model’s predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting—one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model’s error on the test data—that is, the designed sequences—has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty. National Academy of Sciences 2022-10-18 2022-10-25 /pmc/articles/PMC9618043/ /pubmed/36256807 http://dx.doi.org/10.1073/pnas.2204569119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle Physical Sciences
Fannjiang, Clara
Bates, Stephen
Angelopoulos, Anastasios N.
Listgarten, Jennifer
Jordan, Michael I.
Conformal prediction under feedback covariate shift for biomolecular design
title Conformal prediction under feedback covariate shift for biomolecular design
title_full Conformal prediction under feedback covariate shift for biomolecular design
title_fullStr Conformal prediction under feedback covariate shift for biomolecular design
title_full_unstemmed Conformal prediction under feedback covariate shift for biomolecular design
title_short Conformal prediction under feedback covariate shift for biomolecular design
title_sort conformal prediction under feedback covariate shift for biomolecular design
topic Physical Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9618043/
https://www.ncbi.nlm.nih.gov/pubmed/36256807
http://dx.doi.org/10.1073/pnas.2204569119
work_keys_str_mv AT fannjiangclara conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign
AT batesstephen conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign
AT angelopoulosanastasiosn conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign
AT listgartenjennifer conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign
AT jordanmichaeli conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign