Cargando…

Estimation of the applicability domain of kernel-based machine learning models for virtual screening

BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give relia...

Descripción completa

Detalles Bibliográficos
Autores principales: Fechner, Nikolas, Jahn, Andreas, Hinselmann, Georg, Zell, Andreas
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851576/
https://www.ncbi.nlm.nih.gov/pubmed/20222949
http://dx.doi.org/10.1186/1758-2946-2-2
_version_ 1782179875046555648
author Fechner, Nikolas
Jahn, Andreas
Hinselmann, Georg
Zell, Andreas
author_facet Fechner, Nikolas
Jahn, Andreas
Hinselmann, Georg
Zell, Andreas
author_sort Fechner, Nikolas
collection PubMed
description BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS: We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.
format Text
id pubmed-2851576
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28515762010-04-09 Estimation of the applicability domain of kernel-based machine learning models for virtual screening Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas J Cheminform Research article BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS: We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway. BioMed Central 2010-03-11 /pmc/articles/PMC2851576/ /pubmed/20222949 http://dx.doi.org/10.1186/1758-2946-2-2 Text en Copyright ©2010 Fechner et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Fechner, Nikolas
Jahn, Andreas
Hinselmann, Georg
Zell, Andreas
Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_full Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_fullStr Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_full_unstemmed Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_short Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_sort estimation of the applicability domain of kernel-based machine learning models for virtual screening
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851576/
https://www.ncbi.nlm.nih.gov/pubmed/20222949
http://dx.doi.org/10.1186/1758-2946-2-2
work_keys_str_mv AT fechnernikolas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening
AT jahnandreas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening
AT hinselmanngeorg estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening
AT zellandreas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening