Cargando…

Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level

Effective computational prediction of complex or novel molecule syntheses can greatly help organic and medicinal chemistry. Retrosynthetic analysis is a method employed by chemists to predict synthetic routes to target compounds. The target compounds are incrementally converted into simpler compound...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bai, Renren, Zhang, Chengyun, Wang, Ling, Yao, Chuansheng, Ge, Jiamin, Duan, Hongliang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7287934/ https://www.ncbi.nlm.nih.gov/pubmed/32438572 http://dx.doi.org/10.3390/molecules25102357

_version_	1783545164832178176
author	Bai, Renren Zhang, Chengyun Wang, Ling Yao, Chuansheng Ge, Jiamin Duan, Hongliang
author_facet	Bai, Renren Zhang, Chengyun Wang, Ling Yao, Chuansheng Ge, Jiamin Duan, Hongliang
author_sort	Bai, Renren
collection	PubMed
description	Effective computational prediction of complex or novel molecule syntheses can greatly help organic and medicinal chemistry. Retrosynthetic analysis is a method employed by chemists to predict synthetic routes to target compounds. The target compounds are incrementally converted into simpler compounds until the starting compounds are commercially available. However, predictions based on small chemical datasets often result in low accuracy due to an insufficient number of samples. To address this limitation, we introduced transfer learning to retrosynthetic analysis. Transfer learning is a machine learning approach that trains a model on one task and then applies the model to a related but different task; this approach can be used to solve the limitation of few data. The unclassified USPTO-380K large dataset was first applied to models for pretraining so that they gain a basic theoretical knowledge of chemistry, such as the chirality of compounds, reaction types and the SMILES form of chemical structure of compounds. The USPTO-380K and the USPTO-50K (which was also used by Liu et al.) were originally derived from Lowe’s patent mining work. Liu et al. further processed these data and divided the reaction examples into 10 categories, but we did not. Subsequently, the acquired skills were transferred to be used on the classified USPTO-50K small dataset for continuous training and retrosynthetic reaction tests, and the pretrained accuracy data were simultaneously compared with the accuracy of results from models without pretraining. The transfer learning concept was combined with the sequence-to-sequence (seq2seq) or Transformer model for prediction and verification. The seq2seq and Transformer models, both of which are based on an encoder-decoder architecture, were originally constructed for language translation missions. The two algorithms translate SMILES form of structures of reactants to SMILES form of products, also taking into account other relevant chemical information (chirality, reaction types and conditions). The results demonstrated that the accuracy of the retrosynthetic analysis by the seq2seq and Transformer models after pretraining was significantly improved. The top-1 accuracy (which is the accuracy rate of the first prediction matching the actual result) of the Transformer-transfer-learning model increased from 52.4% to 60.7% with greatly improved prediction power. The model’s top-20 prediction accuracy (which is the accuracy rate of the top 20 categories containing actual results) was 88.9%, which represents fairly good prediction in retrosynthetic analysis. In summary, this study proves that transferring learning between models working with different chemical datasets is feasible. The introduction of transfer learning to a model significantly improved prediction accuracy and, especially, assisted in small dataset based reaction prediction and retrosynthetic analysis.
format	Online Article Text
id	pubmed-7287934
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-72879342020-06-15 Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level Bai, Renren Zhang, Chengyun Wang, Ling Yao, Chuansheng Ge, Jiamin Duan, Hongliang Molecules Article Effective computational prediction of complex or novel molecule syntheses can greatly help organic and medicinal chemistry. Retrosynthetic analysis is a method employed by chemists to predict synthetic routes to target compounds. The target compounds are incrementally converted into simpler compounds until the starting compounds are commercially available. However, predictions based on small chemical datasets often result in low accuracy due to an insufficient number of samples. To address this limitation, we introduced transfer learning to retrosynthetic analysis. Transfer learning is a machine learning approach that trains a model on one task and then applies the model to a related but different task; this approach can be used to solve the limitation of few data. The unclassified USPTO-380K large dataset was first applied to models for pretraining so that they gain a basic theoretical knowledge of chemistry, such as the chirality of compounds, reaction types and the SMILES form of chemical structure of compounds. The USPTO-380K and the USPTO-50K (which was also used by Liu et al.) were originally derived from Lowe’s patent mining work. Liu et al. further processed these data and divided the reaction examples into 10 categories, but we did not. Subsequently, the acquired skills were transferred to be used on the classified USPTO-50K small dataset for continuous training and retrosynthetic reaction tests, and the pretrained accuracy data were simultaneously compared with the accuracy of results from models without pretraining. The transfer learning concept was combined with the sequence-to-sequence (seq2seq) or Transformer model for prediction and verification. The seq2seq and Transformer models, both of which are based on an encoder-decoder architecture, were originally constructed for language translation missions. The two algorithms translate SMILES form of structures of reactants to SMILES form of products, also taking into account other relevant chemical information (chirality, reaction types and conditions). The results demonstrated that the accuracy of the retrosynthetic analysis by the seq2seq and Transformer models after pretraining was significantly improved. The top-1 accuracy (which is the accuracy rate of the first prediction matching the actual result) of the Transformer-transfer-learning model increased from 52.4% to 60.7% with greatly improved prediction power. The model’s top-20 prediction accuracy (which is the accuracy rate of the top 20 categories containing actual results) was 88.9%, which represents fairly good prediction in retrosynthetic analysis. In summary, this study proves that transferring learning between models working with different chemical datasets is feasible. The introduction of transfer learning to a model significantly improved prediction accuracy and, especially, assisted in small dataset based reaction prediction and retrosynthetic analysis. MDPI 2020-05-19 /pmc/articles/PMC7287934/ /pubmed/32438572 http://dx.doi.org/10.3390/molecules25102357 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Bai, Renren Zhang, Chengyun Wang, Ling Yao, Chuansheng Ge, Jiamin Duan, Hongliang Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level
title	Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level
title_full	Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level
title_fullStr	Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level
title_full_unstemmed	Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level
title_short	Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level
title_sort	transfer learning: making retrosynthetic predictions based on a small chemical reaction dataset scale to a new level
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7287934/ https://www.ncbi.nlm.nih.gov/pubmed/32438572 http://dx.doi.org/10.3390/molecules25102357
work_keys_str_mv	AT bairenren transferlearningmakingretrosyntheticpredictionsbasedonasmallchemicalreactiondatasetscaletoanewlevel AT zhangchengyun transferlearningmakingretrosyntheticpredictionsbasedonasmallchemicalreactiondatasetscaletoanewlevel AT wangling transferlearningmakingretrosyntheticpredictionsbasedonasmallchemicalreactiondatasetscaletoanewlevel AT yaochuansheng transferlearningmakingretrosyntheticpredictionsbasedonasmallchemicalreactiondatasetscaletoanewlevel AT gejiamin transferlearningmakingretrosyntheticpredictionsbasedonasmallchemicalreactiondatasetscaletoanewlevel AT duanhongliang transferlearningmakingretrosyntheticpredictionsbasedonasmallchemicalreactiondatasetscaletoanewlevel

Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level

Ejemplares similares