Cargando…
Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models
A reliable and practical determination of a chemical species’ solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, ef...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10583449/ https://www.ncbi.nlm.nih.gov/pubmed/37853492 http://dx.doi.org/10.1186/s13321-023-00752-6 |
_version_ | 1785122555304083456 |
---|---|
author | Tayyebi, Arash Alshami, Ali S Rabiei, Zeinab Yu, Xue Ismail, Nadhem Talukder, Musabbir Jahan Power, Jason |
author_facet | Tayyebi, Arash Alshami, Ali S Rabiei, Zeinab Yu, Xue Ismail, Nadhem Talukder, Musabbir Jahan Power, Jason |
author_sort | Tayyebi, Arash |
collection | PubMed |
description | A reliable and practical determination of a chemical species’ solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species’ solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R(2)) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00752-6. |
format | Online Article Text |
id | pubmed-10583449 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-105834492023-10-19 Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models Tayyebi, Arash Alshami, Ali S Rabiei, Zeinab Yu, Xue Ismail, Nadhem Talukder, Musabbir Jahan Power, Jason J Cheminform Research A reliable and practical determination of a chemical species’ solubility in water continues to be examined using empirical observations and exhaustive experimental studies alone. Predictions of chemical solubility in water using data-driven algorithms can allow us to create a rationally designed, efficient, and cost-effective tool for next-generation materials and chemical formulations. We present results from two machine learning (ML) modeling studies to adequately predict various species’ solubility using data for over 8400 compounds. Molecular-descriptors, the most used method in previous studies, and Morgan fingerprint, a circular-based hash of the molecules' structures, were applied to produce water solubility estimates. We trained all models on 80% of the total datasets using the Random Forest (RFs) technique as the regressor and tested the prediction performance using the remaining 20%, resulting in coefficient of determination (R(2)) test values of 0.88 and 0.81 and root-mean-square deviation (RMSE) test values 0.64 and 0.80 for the descriptors and circular fingerprint methods, respectively. We interpreted the produced ML models and reported the most effective features for aqueous solubility measures using the Shapley Additive exPlanations (SHAP) and thermodynamic analysis. Low error, ability to investigate the molecular-level interactions, and compatibility with thermodynamic quantities made the fingerprint method a distinct model compared to other available computational tools. However, it is worth emphasizing that physicochemical descriptor model outperformed the fingerprint model in achieving better predictive accuracy for the given test set. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00752-6. Springer International Publishing 2023-10-18 /pmc/articles/PMC10583449/ /pubmed/37853492 http://dx.doi.org/10.1186/s13321-023-00752-6 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Tayyebi, Arash Alshami, Ali S Rabiei, Zeinab Yu, Xue Ismail, Nadhem Talukder, Musabbir Jahan Power, Jason Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models |
title | Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models |
title_full | Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models |
title_fullStr | Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models |
title_full_unstemmed | Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models |
title_short | Prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models |
title_sort | prediction of organic compound aqueous solubility using machine learning: a comparison study of descriptor-based and fingerprints-based models |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10583449/ https://www.ncbi.nlm.nih.gov/pubmed/37853492 http://dx.doi.org/10.1186/s13321-023-00752-6 |
work_keys_str_mv | AT tayyebiarash predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels AT alshamialis predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels AT rabieizeinab predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels AT yuxue predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels AT ismailnadhem predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels AT talukdermusabbirjahan predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels AT powerjason predictionoforganiccompoundaqueoussolubilityusingmachinelearningacomparisonstudyofdescriptorbasedandfingerprintsbasedmodels |