Cargando…

On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data

While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other ha...

Descripción completa

Detalles Bibliográficos
Autores principales: Handa, Koichi, Thomas, Morgan C., Kageyama, Michiharu, Iijima, Takeshi, Bender, Andreas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10664602/
https://www.ncbi.nlm.nih.gov/pubmed/37990215
http://dx.doi.org/10.1186/s13321-023-00781-1
_version_ 1785148762478346240
author Handa, Koichi
Thomas, Morgan C.
Kageyama, Michiharu
Iijima, Takeshi
Bender, Andreas
author_facet Handa, Koichi
Thomas, Morgan C.
Kageyama, Michiharu
Iijima, Takeshi
Bender, Andreas
author_sort Handa, Koichi
collection PubMed
description While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other hand prospective validation is expensive and then often biased by the human selection process. In this case study, we frame retrospective validation as the ability to mimic human drug design, by answering the following question: Can a generative model trained on early-stage project compounds generate middle/late-stage compounds de novo? To this end, we used experimental data that contains the elapsed time of a synthetic expansion following hit identification from five public (where the time series was pre-processed to better reflect realistic synthetic expansions) and six in-house project datasets, and used REINVENT as a widely adopted RNN-based generative model. After splitting the dataset and training REINVENT on early-stage compounds, we found that rediscovery of middle/late-stage compounds was much higher in public projects (at 1.60%, 0.64%, and 0.21% of the top 100, 500, and 5000 scored generated compounds) than in in-house projects (where the values were 0.00%, 0.03%, and 0.04%, respectively). Similarly, average single nearest neighbour similarity between early- and middle/late-stage compounds in public projects was higher between active compounds than inactive compounds; however, for in-house projects the converse was true, which makes rediscovery (if so desired) more difficult. We hence show that the generative model recovers very few middle/late-stage compounds from real-world drug discovery projects, highlighting the fundamental difference between purely algorithmic design and drug discovery as a real-world process. Evaluating de novo compound design approaches appears, based on the current study, difficult or even impossible to do retrospectively. Scientific Contribution This contribution hence illustrates aspects of evaluating the performance of generative models in a real-world setting which have not been extensively described previously and which hopefully contribute to their further future development. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00781-1.
format Online
Article
Text
id pubmed-10664602
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-106646022023-11-21 On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data Handa, Koichi Thomas, Morgan C. Kageyama, Michiharu Iijima, Takeshi Bender, Andreas J Cheminform Research While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other hand prospective validation is expensive and then often biased by the human selection process. In this case study, we frame retrospective validation as the ability to mimic human drug design, by answering the following question: Can a generative model trained on early-stage project compounds generate middle/late-stage compounds de novo? To this end, we used experimental data that contains the elapsed time of a synthetic expansion following hit identification from five public (where the time series was pre-processed to better reflect realistic synthetic expansions) and six in-house project datasets, and used REINVENT as a widely adopted RNN-based generative model. After splitting the dataset and training REINVENT on early-stage compounds, we found that rediscovery of middle/late-stage compounds was much higher in public projects (at 1.60%, 0.64%, and 0.21% of the top 100, 500, and 5000 scored generated compounds) than in in-house projects (where the values were 0.00%, 0.03%, and 0.04%, respectively). Similarly, average single nearest neighbour similarity between early- and middle/late-stage compounds in public projects was higher between active compounds than inactive compounds; however, for in-house projects the converse was true, which makes rediscovery (if so desired) more difficult. We hence show that the generative model recovers very few middle/late-stage compounds from real-world drug discovery projects, highlighting the fundamental difference between purely algorithmic design and drug discovery as a real-world process. Evaluating de novo compound design approaches appears, based on the current study, difficult or even impossible to do retrospectively. Scientific Contribution This contribution hence illustrates aspects of evaluating the performance of generative models in a real-world setting which have not been extensively described previously and which hopefully contribute to their further future development. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00781-1. Springer International Publishing 2023-11-21 /pmc/articles/PMC10664602/ /pubmed/37990215 http://dx.doi.org/10.1186/s13321-023-00781-1 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Handa, Koichi
Thomas, Morgan C.
Kageyama, Michiharu
Iijima, Takeshi
Bender, Andreas
On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
title On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
title_full On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
title_fullStr On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
title_full_unstemmed On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
title_short On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
title_sort on the difficulty of validating molecular generative models realistically: a case study on public and proprietary data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10664602/
https://www.ncbi.nlm.nih.gov/pubmed/37990215
http://dx.doi.org/10.1186/s13321-023-00781-1
work_keys_str_mv AT handakoichi onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata
AT thomasmorganc onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata
AT kageyamamichiharu onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata
AT iijimatakeshi onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata
AT benderandreas onthedifficultyofvalidatingmoleculargenerativemodelsrealisticallyacasestudyonpublicandproprietarydata