Cargando…

On the use of real-world datasets for reaction yield prediction

The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly avail...

Descripción completa

Detalles Bibliográficos
Autores principales: Saebi, Mandana, Nan, Bozhao, Herr, John E., Wahlers, Jessica, Guo, Zhichun, Zurański, Andrzej M., Kogej, Thierry, Norrby, Per-Ola, Doyle, Abigail G., Chawla, Nitesh V., Wiest, Olaf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Royal Society of Chemistry 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10189898/
https://www.ncbi.nlm.nih.gov/pubmed/37206399
http://dx.doi.org/10.1039/d2sc06041h
_version_ 1785043180930990080
author Saebi, Mandana
Nan, Bozhao
Herr, John E.
Wahlers, Jessica
Guo, Zhichun
Zurański, Andrzej M.
Kogej, Thierry
Norrby, Per-Ola
Doyle, Abigail G.
Chawla, Nitesh V.
Wiest, Olaf
author_facet Saebi, Mandana
Nan, Bozhao
Herr, John E.
Wahlers, Jessica
Guo, Zhichun
Zurański, Andrzej M.
Kogej, Thierry
Norrby, Per-Ola
Doyle, Abigail G.
Chawla, Nitesh V.
Wiest, Olaf
author_sort Saebi, Mandana
collection PubMed
description The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki–Miyaura and Buchwald–Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions.
format Online
Article
Text
id pubmed-10189898
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher The Royal Society of Chemistry
record_format MEDLINE/PubMed
spelling pubmed-101898982023-05-18 On the use of real-world datasets for reaction yield prediction Saebi, Mandana Nan, Bozhao Herr, John E. Wahlers, Jessica Guo, Zhichun Zurański, Andrzej M. Kogej, Thierry Norrby, Per-Ola Doyle, Abigail G. Chawla, Nitesh V. Wiest, Olaf Chem Sci Chemistry The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki–Miyaura and Buchwald–Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions. The Royal Society of Chemistry 2023-03-13 /pmc/articles/PMC10189898/ /pubmed/37206399 http://dx.doi.org/10.1039/d2sc06041h Text en This journal is © The Royal Society of Chemistry https://creativecommons.org/licenses/by-nc/3.0/
spellingShingle Chemistry
Saebi, Mandana
Nan, Bozhao
Herr, John E.
Wahlers, Jessica
Guo, Zhichun
Zurański, Andrzej M.
Kogej, Thierry
Norrby, Per-Ola
Doyle, Abigail G.
Chawla, Nitesh V.
Wiest, Olaf
On the use of real-world datasets for reaction yield prediction
title On the use of real-world datasets for reaction yield prediction
title_full On the use of real-world datasets for reaction yield prediction
title_fullStr On the use of real-world datasets for reaction yield prediction
title_full_unstemmed On the use of real-world datasets for reaction yield prediction
title_short On the use of real-world datasets for reaction yield prediction
title_sort on the use of real-world datasets for reaction yield prediction
topic Chemistry
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10189898/
https://www.ncbi.nlm.nih.gov/pubmed/37206399
http://dx.doi.org/10.1039/d2sc06041h
work_keys_str_mv AT saebimandana ontheuseofrealworlddatasetsforreactionyieldprediction
AT nanbozhao ontheuseofrealworlddatasetsforreactionyieldprediction
AT herrjohne ontheuseofrealworlddatasetsforreactionyieldprediction
AT wahlersjessica ontheuseofrealworlddatasetsforreactionyieldprediction
AT guozhichun ontheuseofrealworlddatasetsforreactionyieldprediction
AT zuranskiandrzejm ontheuseofrealworlddatasetsforreactionyieldprediction
AT kogejthierry ontheuseofrealworlddatasetsforreactionyieldprediction
AT norrbyperola ontheuseofrealworlddatasetsforreactionyieldprediction
AT doyleabigailg ontheuseofrealworlddatasetsforreactionyieldprediction
AT chawlaniteshv ontheuseofrealworlddatasetsforreactionyieldprediction
AT wiestolaf ontheuseofrealworlddatasetsforreactionyieldprediction