Cargando…

Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model

As a template-free, data-driven methodology, the molecular transformer model provides an alternative by which to predict the outcome of chemical reactions and design the route of the retrosynthetic plane in the field of organic synthesis and polymer chemistry. However, in consideration of the small...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Boyu, Lin, Jiaping, Du, Lei, Zhang, Liangshun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10180765/
https://www.ncbi.nlm.nih.gov/pubmed/37177370
http://dx.doi.org/10.3390/polym15092224
_version_ 1785041413065408512
author Zhang, Boyu
Lin, Jiaping
Du, Lei
Zhang, Liangshun
author_facet Zhang, Boyu
Lin, Jiaping
Du, Lei
Zhang, Liangshun
author_sort Zhang, Boyu
collection PubMed
description As a template-free, data-driven methodology, the molecular transformer model provides an alternative by which to predict the outcome of chemical reactions and design the route of the retrosynthetic plane in the field of organic synthesis and polymer chemistry. However, in consideration of the small datasets of chemical reactions, the data-driven model suffers from the difficulty of low accuracy in the prediction tasks of chemical reactions. In this contribution, we integrate the molecular transformer model with the strategies of data augmentation and normalization preprocessing to accomplish the three tasks of chemical reactions, including the forward predictions of chemical reactions, and single-step retrosynthetic predictions with and without the reaction classes. It is clearly demonstrated that the prediction accuracy of the molecular transformer model can be significantly raised by the use of proposed strategies for the three tasks of chemical reactions. Notably, after the introduction of the 40-level data augmentation and normalization preprocessing, the top-1 accuracy of the forward prediction increases markedly from 71.6% to 84.2% and the top-1 accuracy of the single-step retrosynthetic prediction with additional reaction class increases from 53.2% to 63.4%. Furthermore, it is found that the superior performance of the data-driven model originates from the correction of the grammatical errors of the SMILES strings, especially for the case of the reaction classes with small datasets.
format Online
Article
Text
id pubmed-10180765
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-101807652023-05-13 Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model Zhang, Boyu Lin, Jiaping Du, Lei Zhang, Liangshun Polymers (Basel) Article As a template-free, data-driven methodology, the molecular transformer model provides an alternative by which to predict the outcome of chemical reactions and design the route of the retrosynthetic plane in the field of organic synthesis and polymer chemistry. However, in consideration of the small datasets of chemical reactions, the data-driven model suffers from the difficulty of low accuracy in the prediction tasks of chemical reactions. In this contribution, we integrate the molecular transformer model with the strategies of data augmentation and normalization preprocessing to accomplish the three tasks of chemical reactions, including the forward predictions of chemical reactions, and single-step retrosynthetic predictions with and without the reaction classes. It is clearly demonstrated that the prediction accuracy of the molecular transformer model can be significantly raised by the use of proposed strategies for the three tasks of chemical reactions. Notably, after the introduction of the 40-level data augmentation and normalization preprocessing, the top-1 accuracy of the forward prediction increases markedly from 71.6% to 84.2% and the top-1 accuracy of the single-step retrosynthetic prediction with additional reaction class increases from 53.2% to 63.4%. Furthermore, it is found that the superior performance of the data-driven model originates from the correction of the grammatical errors of the SMILES strings, especially for the case of the reaction classes with small datasets. MDPI 2023-05-08 /pmc/articles/PMC10180765/ /pubmed/37177370 http://dx.doi.org/10.3390/polym15092224 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Zhang, Boyu
Lin, Jiaping
Du, Lei
Zhang, Liangshun
Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model
title Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model
title_full Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model
title_fullStr Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model
title_full_unstemmed Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model
title_short Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model
title_sort harnessing data augmentation and normalization preprocessing to improve the performance of chemical reaction predictions of data-driven model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10180765/
https://www.ncbi.nlm.nih.gov/pubmed/37177370
http://dx.doi.org/10.3390/polym15092224
work_keys_str_mv AT zhangboyu harnessingdataaugmentationandnormalizationpreprocessingtoimprovetheperformanceofchemicalreactionpredictionsofdatadrivenmodel
AT linjiaping harnessingdataaugmentationandnormalizationpreprocessingtoimprovetheperformanceofchemicalreactionpredictionsofdatadrivenmodel
AT dulei harnessingdataaugmentationandnormalizationpreprocessingtoimprovetheperformanceofchemicalreactionpredictionsofdatadrivenmodel
AT zhangliangshun harnessingdataaugmentationandnormalizationpreprocessingtoimprovetheperformanceofchemicalreactionpredictionsofdatadrivenmodel