Cargando…
Conformal prediction under feedback covariate shift for biomolecular design
Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to pred...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9618043/ https://www.ncbi.nlm.nih.gov/pubmed/36256807 http://dx.doi.org/10.1073/pnas.2204569119 |
_version_ | 1784820966762741760 |
---|---|
author | Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I. |
author_facet | Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I. |
author_sort | Fannjiang, Clara |
collection | PubMed |
description | Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model’s predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting—one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model’s error on the test data—that is, the designed sequences—has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty. |
format | Online Article Text |
id | pubmed-9618043 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | National Academy of Sciences |
record_format | MEDLINE/PubMed |
spelling | pubmed-96180432022-10-31 Conformal prediction under feedback covariate shift for biomolecular design Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I. Proc Natl Acad Sci U S A Physical Sciences Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model’s predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting—one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model’s error on the test data—that is, the designed sequences—has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty. National Academy of Sciences 2022-10-18 2022-10-25 /pmc/articles/PMC9618043/ /pubmed/36256807 http://dx.doi.org/10.1073/pnas.2204569119 Text en Copyright © 2022 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) . |
spellingShingle | Physical Sciences Fannjiang, Clara Bates, Stephen Angelopoulos, Anastasios N. Listgarten, Jennifer Jordan, Michael I. Conformal prediction under feedback covariate shift for biomolecular design |
title | Conformal prediction under feedback covariate shift for biomolecular design |
title_full | Conformal prediction under feedback covariate shift for biomolecular design |
title_fullStr | Conformal prediction under feedback covariate shift for biomolecular design |
title_full_unstemmed | Conformal prediction under feedback covariate shift for biomolecular design |
title_short | Conformal prediction under feedback covariate shift for biomolecular design |
title_sort | conformal prediction under feedback covariate shift for biomolecular design |
topic | Physical Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9618043/ https://www.ncbi.nlm.nih.gov/pubmed/36256807 http://dx.doi.org/10.1073/pnas.2204569119 |
work_keys_str_mv | AT fannjiangclara conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT batesstephen conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT angelopoulosanastasiosn conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT listgartenjennifer conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign AT jordanmichaeli conformalpredictionunderfeedbackcovariateshiftforbiomoleculardesign |