Cargando…

Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain

Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal E...

Descripción completa

Detalles Bibliográficos
Autores principales: Thakkar, Amol, Kogej, Thierry, Reymond, Jean-Louis, Engkvist, Ola, Bjerrum, Esben Jannik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Royal Society of Chemistry 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7012039/
https://www.ncbi.nlm.nih.gov/pubmed/32110367
http://dx.doi.org/10.1039/c9sc04944d
_version_ 1783496181234532352
author Thakkar, Amol
Kogej, Thierry
Reymond, Jean-Louis
Engkvist, Ola
Bjerrum, Esben Jannik
author_facet Thakkar, Amol
Kogej, Thierry
Reymond, Jean-Louis
Engkvist, Ola
Bjerrum, Esben Jannik
author_sort Thakkar, Amol
collection PubMed
description Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the policy network, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models.
format Online
Article
Text
id pubmed-7012039
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Royal Society of Chemistry
record_format MEDLINE/PubMed
spelling pubmed-70120392020-02-27 Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain Thakkar, Amol Kogej, Thierry Reymond, Jean-Louis Engkvist, Ola Bjerrum, Esben Jannik Chem Sci Chemistry Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the policy network, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models. Royal Society of Chemistry 2019-11-05 /pmc/articles/PMC7012039/ /pubmed/32110367 http://dx.doi.org/10.1039/c9sc04944d Text en This journal is © The Royal Society of Chemistry 2020 http://creativecommons.org/licenses/by/3.0/ This article is freely available. This article is licensed under a Creative Commons Attribution 3.0 Unported Licence (CC BY 3.0)
spellingShingle Chemistry
Thakkar, Amol
Kogej, Thierry
Reymond, Jean-Louis
Engkvist, Ola
Bjerrum, Esben Jannik
Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
title Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
title_full Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
title_fullStr Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
title_full_unstemmed Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
title_short Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
title_sort datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain
topic Chemistry
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7012039/
https://www.ncbi.nlm.nih.gov/pubmed/32110367
http://dx.doi.org/10.1039/c9sc04944d
work_keys_str_mv AT thakkaramol datasetsandtheirinfluenceonthedevelopmentofcomputerassistedsynthesisplanningtoolsinthepharmaceuticaldomain
AT kogejthierry datasetsandtheirinfluenceonthedevelopmentofcomputerassistedsynthesisplanningtoolsinthepharmaceuticaldomain
AT reymondjeanlouis datasetsandtheirinfluenceonthedevelopmentofcomputerassistedsynthesisplanningtoolsinthepharmaceuticaldomain
AT engkvistola datasetsandtheirinfluenceonthedevelopmentofcomputerassistedsynthesisplanningtoolsinthepharmaceuticaldomain
AT bjerrumesbenjannik datasetsandtheirinfluenceonthedevelopmentofcomputerassistedsynthesisplanningtoolsinthepharmaceuticaldomain