Cargando…

Unbiasing Retrosynthesis Language Models with Disconnection Prompts

[Image: see text] Data-driven approaches to retrosynthesis are limited in user interaction, diversity of their predictions, and recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in natural language processing to the task of chemical langu...

Descripción completa

Detalles Bibliográficos
Autores principales: Thakkar, Amol, Vaucher, Alain C., Byekwaso, Andrea, Schwaller, Philippe, Toniato, Alessandra, Laino, Teodoro
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10390024/
https://www.ncbi.nlm.nih.gov/pubmed/37529205
http://dx.doi.org/10.1021/acscentsci.3c00372
_version_ 1785082390084845568
author Thakkar, Amol
Vaucher, Alain C.
Byekwaso, Andrea
Schwaller, Philippe
Toniato, Alessandra
Laino, Teodoro
author_facet Thakkar, Amol
Vaucher, Alain C.
Byekwaso, Andrea
Schwaller, Philippe
Toniato, Alessandra
Laino, Teodoro
author_sort Thakkar, Amol
collection PubMed
description [Image: see text] Data-driven approaches to retrosynthesis are limited in user interaction, diversity of their predictions, and recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule we can steer the model to propose a broader set of precursors, thereby overcoming training data biases in retrosynthetic recommendations and achieving a 39% performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them greater control over the disconnection predictions, which results in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a two-stage schema consisting of automatic identification of disconnection sites, followed by prediction of reactant sets, thereby achieving a considerable improvement in class diversity compared with the baseline. The approach is effective in mitigating prediction biases derived from training data. This provides a wider variety of usable building blocks and improves the end user’s digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is critical.
format Online
Article
Text
id pubmed-10390024
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-103900242023-08-01 Unbiasing Retrosynthesis Language Models with Disconnection Prompts Thakkar, Amol Vaucher, Alain C. Byekwaso, Andrea Schwaller, Philippe Toniato, Alessandra Laino, Teodoro ACS Cent Sci [Image: see text] Data-driven approaches to retrosynthesis are limited in user interaction, diversity of their predictions, and recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule we can steer the model to propose a broader set of precursors, thereby overcoming training data biases in retrosynthetic recommendations and achieving a 39% performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them greater control over the disconnection predictions, which results in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a two-stage schema consisting of automatic identification of disconnection sites, followed by prediction of reactant sets, thereby achieving a considerable improvement in class diversity compared with the baseline. The approach is effective in mitigating prediction biases derived from training data. This provides a wider variety of usable building blocks and improves the end user’s digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is critical. American Chemical Society 2023-07-05 /pmc/articles/PMC10390024/ /pubmed/37529205 http://dx.doi.org/10.1021/acscentsci.3c00372 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Thakkar, Amol
Vaucher, Alain C.
Byekwaso, Andrea
Schwaller, Philippe
Toniato, Alessandra
Laino, Teodoro
Unbiasing Retrosynthesis Language Models with Disconnection Prompts
title Unbiasing Retrosynthesis Language Models with Disconnection Prompts
title_full Unbiasing Retrosynthesis Language Models with Disconnection Prompts
title_fullStr Unbiasing Retrosynthesis Language Models with Disconnection Prompts
title_full_unstemmed Unbiasing Retrosynthesis Language Models with Disconnection Prompts
title_short Unbiasing Retrosynthesis Language Models with Disconnection Prompts
title_sort unbiasing retrosynthesis language models with disconnection prompts
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10390024/
https://www.ncbi.nlm.nih.gov/pubmed/37529205
http://dx.doi.org/10.1021/acscentsci.3c00372
work_keys_str_mv AT thakkaramol unbiasingretrosynthesislanguagemodelswithdisconnectionprompts
AT vaucheralainc unbiasingretrosynthesislanguagemodelswithdisconnectionprompts
AT byekwasoandrea unbiasingretrosynthesislanguagemodelswithdisconnectionprompts
AT schwallerphilippe unbiasingretrosynthesislanguagemodelswithdisconnectionprompts
AT toniatoalessandra unbiasingretrosynthesislanguagemodelswithdisconnectionprompts
AT lainoteodoro unbiasingretrosynthesislanguagemodelswithdisconnectionprompts