Cargando…

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wen, Mingjian, Blau, Samuel M., Xie, Xiaowei, Dwaraknath, Shyam, Persson, Kristin A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	The Royal Society of Chemistry 2022
Materias:	Chemistry
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8809395/ https://www.ncbi.nlm.nih.gov/pubmed/35222929 http://dx.doi.org/10.1039/d1sc06515g

_version_	1784644003520577536
author	Wen, Mingjian Blau, Samuel M. Xie, Xiaowei Dwaraknath, Shyam Persson, Kristin A.
author_facet	Wen, Mingjian Blau, Samuel M. Xie, Xiaowei Dwaraknath, Shyam Persson, Kristin A.
author_sort	Wen, Mingjian
collection	PubMed
description	Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem—classifying reactions into distinct families—and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets, as well as those based on reaction fingerprints derived from masked language modelling. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.
format	Online Article Text
id	pubmed-8809395
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	The Royal Society of Chemistry
record_format	MEDLINE/PubMed
spelling	pubmed-88093952022-02-24 Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining Wen, Mingjian Blau, Samuel M. Xie, Xiaowei Dwaraknath, Shyam Persson, Kristin A. Chem Sci Chemistry Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem—classifying reactions into distinct families—and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets, as well as those based on reaction fingerprints derived from masked language modelling. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels. The Royal Society of Chemistry 2022-01-11 /pmc/articles/PMC8809395/ /pubmed/35222929 http://dx.doi.org/10.1039/d1sc06515g Text en This journal is © The Royal Society of Chemistry https://creativecommons.org/licenses/by-nc/3.0/
spellingShingle	Chemistry Wen, Mingjian Blau, Samuel M. Xie, Xiaowei Dwaraknath, Shyam Persson, Kristin A. Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining
title	Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining
title_full	Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining
title_fullStr	Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining
title_full_unstemmed	Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining
title_short	Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining
title_sort	improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining
topic	Chemistry
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8809395/ https://www.ncbi.nlm.nih.gov/pubmed/35222929 http://dx.doi.org/10.1039/d1sc06515g
work_keys_str_mv	AT wenmingjian improvingmachinelearningperformanceonsmallchemicalreactiondatawithunsupervisedcontrastivepretraining AT blausamuelm improvingmachinelearningperformanceonsmallchemicalreactiondatawithunsupervisedcontrastivepretraining AT xiexiaowei improvingmachinelearningperformanceonsmallchemicalreactiondatawithunsupervisedcontrastivepretraining AT dwaraknathshyam improvingmachinelearningperformanceonsmallchemicalreactiondatawithunsupervisedcontrastivepretraining AT perssonkristina improvingmachinelearningperformanceonsmallchemicalreactiondatawithunsupervisedcontrastivepretraining

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Ejemplares similares