Cargando…

Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models

A reliable and practical determination of a chemical species’ solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, ef...

Descripción completa

Detalles Bibliográficos
Autores principales: Tayyebi, Arash, Alshami, Ali S, Rabiei, Zeinab, Yu, Xue, Ismail, Nadhem, Talukder, Musabbir Jahan, Power, Jason
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10583449/
https://www.ncbi.nlm.nih.gov/pubmed/37853492
http://dx.doi.org/10.1186/s13321-023-00752-6
_version_ 1785122555304083456
author Tayyebi, Arash
Alshami, Ali S
Rabiei, Zeinab
Yu, Xue
Ismail, Nadhem
Talukder, Musabbir Jahan
Power, Jason
author_facet Tayyebi, Arash
Alshami, Ali S
Rabiei, Zeinab
Yu, Xue
Ismail, Nadhem
Talukder, Musabbir Jahan
Power, Jason
author_sort Tayyebi, Arash
collection PubMed
description A reliable and practical determination of a chemical species’ solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species’ solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R(2)) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00752-6.
format Online
Article
Text
id pubmed-10583449
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-105834492023-10-19 Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models Tayyebi, Arash Alshami, Ali S Rabiei, Zeinab Yu, Xue Ismail, Nadhem Talukder, Musabbir Jahan Power, Jason J Cheminform Research A reliable and practical determination of a chemical species’ solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species’ solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R(2)) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00752-6. Springer International Publishing 2023-10-18 /pmc/articles/PMC10583449/ /pubmed/37853492 http://dx.doi.org/10.1186/s13321-023-00752-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Tayyebi, Arash
Alshami, Ali S
Rabiei, Zeinab
Yu, Xue
Ismail, Nadhem
Talukder, Musabbir Jahan
Power, Jason
Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
title Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
title_full Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
title_fullStr Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
title_full_unstemmed Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
title_short Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
title_sort prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10583449/
https://www.ncbi.nlm.nih.gov/pubmed/37853492
http://dx.doi.org/10.1186/s13321-023-00752-6
work_keys_str_mv AT tayyebiarash predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels
AT alshamialis predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels
AT rabieizeinab predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels
AT yuxue predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels
AT ismailnadhem predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels
AT talukdermusabbirjahan predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels
AT powerjason predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels