Cargando…

Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets

[Image: see text] The world is on the verge of a new industrial revolution, and language models are poised to play a pivotal role in this transformative era. Their ability to offer intelligent insights and forecasts has made them a valuable asset for businesses seeking a competitive advantage. The c...

Descripción completa

Detalles Bibliográficos
Autores principales: Toniato, Alessandra, Vaucher, Alain C., Lehmann, Marzena Maria, Luksch, Torsten, Schwaller, Philippe, Stenta, Marco, Laino, Teodoro
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10653079/
https://www.ncbi.nlm.nih.gov/pubmed/38027545
http://dx.doi.org/10.1021/acs.chemmater.3c01406
_version_ 1785147744198852608
author Toniato, Alessandra
Vaucher, Alain C.
Lehmann, Marzena Maria
Luksch, Torsten
Schwaller, Philippe
Stenta, Marco
Laino, Teodoro
author_facet Toniato, Alessandra
Vaucher, Alain C.
Lehmann, Marzena Maria
Luksch, Torsten
Schwaller, Philippe
Stenta, Marco
Laino, Teodoro
author_sort Toniato, Alessandra
collection PubMed
description [Image: see text] The world is on the verge of a new industrial revolution, and language models are poised to play a pivotal role in this transformative era. Their ability to offer intelligent insights and forecasts has made them a valuable asset for businesses seeking a competitive advantage. The chemical industry, in particular, can benefit significantly from harnessing their power. Since 2016 already, language models have been applied to tasks such as predicting reaction outcomes or retrosynthetic routes. While such models have demonstrated impressive abilities, the lack of publicly available data sets with universal coverage is often the limiting factor for achieving even higher accuracies. This makes it imperative for organizations to incorporate proprietary data sets into their model training processes to improve their performance. So far, however, these data sets frequently remain untapped as there are no established criteria for model customization. In this work, we report a successful methodology for retraining language models on reaction outcome prediction and single-step retrosynthesis tasks, using proprietary, nonpublic data sets. We report a considerable boost in accuracy by combining patent and proprietary data in a multidomain learning formulation. This exercise, inspired by a real-world use case, enables us to formulate guidelines that can be adopted in different corporate settings to customize chemical language models easily.
format Online
Article
Text
id pubmed-10653079
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-106530792023-11-16 Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets Toniato, Alessandra Vaucher, Alain C. Lehmann, Marzena Maria Luksch, Torsten Schwaller, Philippe Stenta, Marco Laino, Teodoro Chem Mater [Image: see text] The world is on the verge of a new industrial revolution, and language models are poised to play a pivotal role in this transformative era. Their ability to offer intelligent insights and forecasts has made them a valuable asset for businesses seeking a competitive advantage. The chemical industry, in particular, can benefit significantly from harnessing their power. Since 2016 already, language models have been applied to tasks such as predicting reaction outcomes or retrosynthetic routes. While such models have demonstrated impressive abilities, the lack of publicly available data sets with universal coverage is often the limiting factor for achieving even higher accuracies. This makes it imperative for organizations to incorporate proprietary data sets into their model training processes to improve their performance. So far, however, these data sets frequently remain untapped as there are no established criteria for model customization. In this work, we report a successful methodology for retraining language models on reaction outcome prediction and single-step retrosynthesis tasks, using proprietary, nonpublic data sets. We report a considerable boost in accuracy by combining patent and proprietary data in a multidomain learning formulation. This exercise, inspired by a real-world use case, enables us to formulate guidelines that can be adopted in different corporate settings to customize chemical language models easily. American Chemical Society 2023-10-27 /pmc/articles/PMC10653079/ /pubmed/38027545 http://dx.doi.org/10.1021/acs.chemmater.3c01406 Text en © 2023 The Authors and Syngenta. Published by American Chemical Society https://creativecommons.org/licenses/by-nc-nd/4.0/Permits non-commercial access and re-use, provided that author attribution and integrity are maintained; but does not permit creation of adaptations or other derivative works (https://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Toniato, Alessandra
Vaucher, Alain C.
Lehmann, Marzena Maria
Luksch, Torsten
Schwaller, Philippe
Stenta, Marco
Laino, Teodoro
Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets
title Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets
title_full Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets
title_fullStr Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets
title_full_unstemmed Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets
title_short Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets
title_sort fast customization of chemical language models to out-of-distribution data sets
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10653079/
https://www.ncbi.nlm.nih.gov/pubmed/38027545
http://dx.doi.org/10.1021/acs.chemmater.3c01406
work_keys_str_mv AT toniatoalessandra fastcustomizationofchemicallanguagemodelstooutofdistributiondatasets
AT vaucheralainc fastcustomizationofchemicallanguagemodelstooutofdistributiondatasets
AT lehmannmarzenamaria fastcustomizationofchemicallanguagemodelstooutofdistributiondatasets
AT lukschtorsten fastcustomizationofchemicallanguagemodelstooutofdistributiondatasets
AT schwallerphilippe fastcustomizationofchemicallanguagemodelstooutofdistributiondatasets
AT stentamarco fastcustomizationofchemicallanguagemodelstooutofdistributiondatasets
AT lainoteodoro fastcustomizationofchemicallanguagemodelstooutofdistributiondatasets