Cargando…

Enhancing diversity in language based models for single-step retrosynthesis

Over the past four years, several research groups demonstrated the combination of domain-specific language representation with recent NLP architectures to accelerate innovation in a wide range of scientific fields. Chemistry is a great example. Among the various chemical challenges addressed with la...

Descripción completa

Detalles Bibliográficos
Autores principales: Toniato, Alessandra, Vaucher, Alain C., Schwaller, Philippe, Laino, Teodoro
Formato: Online Artículo Texto
Lenguaje:English
Publicado: RSC 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10087060/
https://www.ncbi.nlm.nih.gov/pubmed/37065677
http://dx.doi.org/10.1039/d2dd00110a
_version_ 1785022264806211584
author Toniato, Alessandra
Vaucher, Alain C.
Schwaller, Philippe
Laino, Teodoro
author_facet Toniato, Alessandra
Vaucher, Alain C.
Schwaller, Philippe
Laino, Teodoro
author_sort Toniato, Alessandra
collection PubMed
description Over the past four years, several research groups demonstrated the combination of domain-specific language representation with recent NLP architectures to accelerate innovation in a wide range of scientific fields. Chemistry is a great example. Among the various chemical challenges addressed with language models, retrosynthesis demonstrates some of the most distinctive successes and limitations. Single-step retrosynthesis, the task of identifying reactions able to decompose a complex molecule into simpler structures, can be cast as a translation problem, in which a text-based representation of the target molecule is converted into a sequence of possible precursors. A common issue is a lack of diversity in the proposed disconnection strategies. The suggested precursors typically fall in the same reaction family, which limits the exploration of the chemical space. We present a retrosynthesis Transformer model that increases the diversity of the predictions by prepending a classification token to the language representation of the target molecule. At inference, the use of these prompt tokens allows us to steer the model towards different kinds of disconnection strategies. We show that the diversity of the predictions improves consistently, which enables recursive synthesis tools to circumvent dead ends and consequently, suggests synthesis pathways for more complex molecules.
format Online
Article
Text
id pubmed-10087060
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher RSC
record_format MEDLINE/PubMed
spelling pubmed-100870602023-04-12 Enhancing diversity in language based models for single-step retrosynthesis Toniato, Alessandra Vaucher, Alain C. Schwaller, Philippe Laino, Teodoro Digit Discov Chemistry Over the past four years, several research groups demonstrated the combination of domain-specific language representation with recent NLP architectures to accelerate innovation in a wide range of scientific fields. Chemistry is a great example. Among the various chemical challenges addressed with language models, retrosynthesis demonstrates some of the most distinctive successes and limitations. Single-step retrosynthesis, the task of identifying reactions able to decompose a complex molecule into simpler structures, can be cast as a translation problem, in which a text-based representation of the target molecule is converted into a sequence of possible precursors. A common issue is a lack of diversity in the proposed disconnection strategies. The suggested precursors typically fall in the same reaction family, which limits the exploration of the chemical space. We present a retrosynthesis Transformer model that increases the diversity of the predictions by prepending a classification token to the language representation of the target molecule. At inference, the use of these prompt tokens allows us to steer the model towards different kinds of disconnection strategies. We show that the diversity of the predictions improves consistently, which enables recursive synthesis tools to circumvent dead ends and consequently, suggests synthesis pathways for more complex molecules. RSC 2023-02-16 /pmc/articles/PMC10087060/ /pubmed/37065677 http://dx.doi.org/10.1039/d2dd00110a Text en This journal is © The Royal Society of Chemistry https://creativecommons.org/licenses/by/3.0/
spellingShingle Chemistry
Toniato, Alessandra
Vaucher, Alain C.
Schwaller, Philippe
Laino, Teodoro
Enhancing diversity in language based models for single-step retrosynthesis
title Enhancing diversity in language based models for single-step retrosynthesis
title_full Enhancing diversity in language based models for single-step retrosynthesis
title_fullStr Enhancing diversity in language based models for single-step retrosynthesis
title_full_unstemmed Enhancing diversity in language based models for single-step retrosynthesis
title_short Enhancing diversity in language based models for single-step retrosynthesis
title_sort enhancing diversity in language based models for single-step retrosynthesis
topic Chemistry
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10087060/
https://www.ncbi.nlm.nih.gov/pubmed/37065677
http://dx.doi.org/10.1039/d2dd00110a
work_keys_str_mv AT toniatoalessandra enhancingdiversityinlanguagebasedmodelsforsinglestepretrosynthesis
AT vaucheralainc enhancingdiversityinlanguagebasedmodelsforsinglestepretrosynthesis
AT schwallerphilippe enhancingdiversityinlanguagebasedmodelsforsinglestepretrosynthesis
AT lainoteodoro enhancingdiversityinlanguagebasedmodelsforsinglestepretrosynthesis