Cargando…
On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other ha...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10664602/ https://www.ncbi.nlm.nih.gov/pubmed/37990215 http://dx.doi.org/10.1186/s13321-023-00781-1 |
_version_ | 1785148762478346240 |
---|---|
author | Handa, Koichi Thomas, Morgan C. Kageyama, Michiharu Iijima, Takeshi Bender, Andreas |
author_facet | Handa, Koichi Thomas, Morgan C. Kageyama, Michiharu Iijima, Takeshi Bender, Andreas |
author_sort | Handa, Koichi |
collection | PubMed |
description | While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other hand prospective validation is expensive and then often biased by the human selection process. In this case study, we frame retrospective validation as the ability to mimic human drug design, by answering the following question: Can a generative model trained on early-stage project compounds generate middle/late-stage compounds de novo? To this end, we used experimental data that contains the elapsed time of a synthetic expansion following hit identification from five public (where the time series was pre-processed to better reflect realistic synthetic expansions) and six in-house project datasets, and used REINVENT as a widely adopted RNN-based generative model. After splitting the dataset and training REINVENT on early-stage compounds, we found that rediscovery of middle/late-stage compounds was much higher in public projects (at 1.60%, 0.64%, and 0.21% of the top 100, 500, and 5000 scored generated compounds) than in in-house projects (where the values were 0.00%, 0.03%, and 0.04%, respectively). Similarly, average single nearest neighbour similarity between early- and middle/late-stage compounds in public projects was higher between active compounds than inactive compounds; however, for in-house projects the converse was true, which makes rediscovery (if so desired) more difficult. We hence show that the generative model recovers very few middle/late-stage compounds from real-world drug discovery projects, highlighting the fundamental difference between purely algorithmic design and drug discovery as a real-world process. Evaluating de novo compound design approaches appears, based on the current study, difficult or even impossible to do retrospectively. Scientific Contribution This contribution hence illustrates aspects of evaluating the performance of generative models in a real-world setting which have not been extensively described previously and which hopefully contribute to their further future development. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00781-1. |
format | Online Article Text |
id | pubmed-10664602 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-106646022023-11-21 On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data Handa, Koichi Thomas, Morgan C. Kageyama, Michiharu Iijima, Takeshi Bender, Andreas J Cheminform Research While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other hand prospective validation is expensive and then often biased by the human selection process. In this case study, we frame retrospective validation as the ability to mimic human drug design, by answering the following question: Can a generative model trained on early-stage project compounds generate middle/late-stage compounds de novo? To this end, we used experimental data that contains the elapsed time of a synthetic expansion following hit identification from five public (where the time series was pre-processed to better reflect realistic synthetic expansions) and six in-house project datasets, and used REINVENT as a widely adopted RNN-based generative model. After splitting the dataset and training REINVENT on early-stage compounds, we found that rediscovery of middle/late-stage compounds was much higher in public projects (at 1.60%, 0.64%, and 0.21% of the top 100, 500, and 5000 scored generated compounds) than in in-house projects (where the values were 0.00%, 0.03%, and 0.04%, respectively). Similarly, average single nearest neighbour similarity between early- and middle/late-stage compounds in public projects was higher between active compounds than inactive compounds; however, for in-house projects the converse was true, which makes rediscovery (if so desired) more difficult. We hence show that the generative model recovers very few middle/late-stage compounds from real-world drug discovery projects, highlighting the fundamental difference between purely algorithmic design and drug discovery as a real-world process. Evaluating de novo compound design approaches appears, based on the current study, difficult or even impossible to do retrospectively. Scientific Contribution This contribution hence illustrates aspects of evaluating the performance of generative models in a real-world setting which have not been extensively described previously and which hopefully contribute to their further future development. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00781-1. Springer International Publishing 2023-11-21 /pmc/articles/PMC10664602/ /pubmed/37990215 http://dx.doi.org/10.1186/s13321-023-00781-1 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Handa, Koichi Thomas, Morgan C. Kageyama, Michiharu Iijima, Takeshi Bender, Andreas On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data |
title | On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data |
title_full | On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data |
title_fullStr | On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data |
title_full_unstemmed | On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data |
title_short | On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data |
title_sort | on the difficulty of validating molecular generative models realistically: a case study on public and proprietary data |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10664602/ https://www.ncbi.nlm.nih.gov/pubmed/37990215 http://dx.doi.org/10.1186/s13321-023-00781-1 |
work_keys_str_mv | AT handakoichi onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata AT thomasmorganc onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata AT kageyamamichiharu onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata AT iijimatakeshi onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata AT benderandreas onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata |