Cargando…

A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing

The knowledge of mixtures’ phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients are often limited due to the high cost of experiments. For an accurate and efficient...

Descripción completa

Detalles Bibliográficos
Autores principales: Winter, Benedikt, Winter, Clemens, Schilling, Johannes, Bardow, André
Formato: Online Artículo Texto
Lenguaje:English
Publicado: RSC 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9721150/
https://www.ncbi.nlm.nih.gov/pubmed/36561987
http://dx.doi.org/10.1039/d2dd00058j
_version_ 1784843707233599488
author Winter, Benedikt
Winter, Clemens
Schilling, Johannes
Bardow, André
author_facet Winter, Benedikt
Winter, Clemens
Schilling, Johannes
Bardow, André
author_sort Winter, Benedikt
collection PubMed
description The knowledge of mixtures’ phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients are often limited due to the high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce a SMILES-to-properties-transformer (SPT), a natural language processing network, to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables the SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS and UNIFAC(Dortmund), and improving on recent machine learning approaches.
format Online
Article
Text
id pubmed-9721150
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher RSC
record_format MEDLINE/PubMed
spelling pubmed-97211502022-12-20 A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing Winter, Benedikt Winter, Clemens Schilling, Johannes Bardow, André Digit Discov Chemistry The knowledge of mixtures’ phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients are often limited due to the high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce a SMILES-to-properties-transformer (SPT), a natural language processing network, to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables the SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS and UNIFAC(Dortmund), and improving on recent machine learning approaches. RSC 2022-09-29 /pmc/articles/PMC9721150/ /pubmed/36561987 http://dx.doi.org/10.1039/d2dd00058j Text en This journal is © The Royal Society of Chemistry https://creativecommons.org/licenses/by-nc/3.0/
spellingShingle Chemistry
Winter, Benedikt
Winter, Clemens
Schilling, Johannes
Bardow, André
A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing
title A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing
title_full A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing
title_fullStr A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing
title_full_unstemmed A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing
title_short A smile is all you need: predicting limiting activity coefficients from SMILES with natural language processing
title_sort smile is all you need: predicting limiting activity coefficients from smiles with natural language processing
topic Chemistry
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9721150/
https://www.ncbi.nlm.nih.gov/pubmed/36561987
http://dx.doi.org/10.1039/d2dd00058j
work_keys_str_mv AT winterbenedikt asmileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing
AT winterclemens asmileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing
AT schillingjohannes asmileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing
AT bardowandre asmileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing
AT winterbenedikt smileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing
AT winterclemens smileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing
AT schillingjohannes smileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing
AT bardowandre smileisallyouneedpredictinglimitingactivitycoefficientsfromsmileswithnaturallanguageprocessing