Cargando…

Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity

[Image: see text] Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental...

Descripción completa

Detalles Bibliográficos
Autores principales: Isert, Clemens, Kromann, Jimmy C., Stiefl, Nikolaus, Schneider, Gisbert, Lewis, Richard A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9850743/
https://www.ncbi.nlm.nih.gov/pubmed/36687099
http://dx.doi.org/10.1021/acsomega.2c05607
_version_ 1784872250361511936
author Isert, Clemens
Kromann, Jimmy C.
Stiefl, Nikolaus
Schneider, Gisbert
Lewis, Richard A.
author_facet Isert, Clemens
Kromann, Jimmy C.
Stiefl, Nikolaus
Schneider, Gisbert
Lewis, Richard A.
author_sort Isert, Clemens
collection PubMed
description [Image: see text] Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data impedes development of accurate in silico models for such compounds. In certain discovery projects at Novartis focused on such compounds, a quantum mechanics (QM)-based tool for log P estimation has emerged as a valuable supplement to experimental measurements and as a preferred alternative to existing empirical models. However, this QM-based approach incurs a substantial computational cost, limiting its applicability to small series and prohibiting quick, interactive ideation. This work explores a set of machine learning models (Random Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated log P values on both a public data set and an in-house data set to obtain a computationally affordable, QM-based estimation of drug lipophilicity. The message-passing neural network model Chemprop emerged as the best performing model with mean absolute errors of 0.44 and 0.34 log units for scaffold split test sets of the public and in-house data sets, respectively. Analysis of learning curves suggests that a further decrease in the test set error can be achieved by increasing the training set size. While models directly trained on experimental data perform better at approximating experimentally determined log P values than models trained on calculated values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation. We analyze the impact of the data set splitting strategy and gain insights into model failure modes. Potential use cases for the presented models include pre-screening of large compound collections and prioritization of compounds for full QM calculations.
format Online
Article
Text
id pubmed-9850743
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-98507432023-01-20 Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity Isert, Clemens Kromann, Jimmy C. Stiefl, Nikolaus Schneider, Gisbert Lewis, Richard A. ACS Omega [Image: see text] Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data impedes development of accurate in silico models for such compounds. In certain discovery projects at Novartis focused on such compounds, a quantum mechanics (QM)-based tool for log P estimation has emerged as a valuable supplement to experimental measurements and as a preferred alternative to existing empirical models. However, this QM-based approach incurs a substantial computational cost, limiting its applicability to small series and prohibiting quick, interactive ideation. This work explores a set of machine learning models (Random Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated log P values on both a public data set and an in-house data set to obtain a computationally affordable, QM-based estimation of drug lipophilicity. The message-passing neural network model Chemprop emerged as the best performing model with mean absolute errors of 0.44 and 0.34 log units for scaffold split test sets of the public and in-house data sets, respectively. Analysis of learning curves suggests that a further decrease in the test set error can be achieved by increasing the training set size. While models directly trained on experimental data perform better at approximating experimentally determined log P values than models trained on calculated values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation. We analyze the impact of the data set splitting strategy and gain insights into model failure modes. Potential use cases for the presented models include pre-screening of large compound collections and prioritization of compounds for full QM calculations. American Chemical Society 2023-01-04 /pmc/articles/PMC9850743/ /pubmed/36687099 http://dx.doi.org/10.1021/acsomega.2c05607 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Isert, Clemens
Kromann, Jimmy C.
Stiefl, Nikolaus
Schneider, Gisbert
Lewis, Richard A.
Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity
title Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity
title_full Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity
title_fullStr Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity
title_full_unstemmed Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity
title_short Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity
title_sort machine learning for fast, quantum mechanics-based approximation of drug lipophilicity
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9850743/
https://www.ncbi.nlm.nih.gov/pubmed/36687099
http://dx.doi.org/10.1021/acsomega.2c05607
work_keys_str_mv AT isertclemens machinelearningforfastquantummechanicsbasedapproximationofdruglipophilicity
AT kromannjimmyc machinelearningforfastquantummechanicsbasedapproximationofdruglipophilicity
AT stieflnikolaus machinelearningforfastquantummechanicsbasedapproximationofdruglipophilicity
AT schneidergisbert machinelearningforfastquantummechanicsbasedapproximationofdruglipophilicity
AT lewisricharda machinelearningforfastquantummechanicsbasedapproximationofdruglipophilicity