Cargando…

Estimation of the applicability domain of kernel-based machine learning models for virtual screening

BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give relia...

Descripción completa

Detalles Bibliográficos
Autores principales:	Fechner, Nikolas, Jahn, Andreas, Hinselmann, Georg, Zell, Andreas
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851576/ https://www.ncbi.nlm.nih.gov/pubmed/20222949 http://dx.doi.org/10.1186/1758-2946-2-2

_version_	1782179875046555648
author	Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas
author_facet	Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas
author_sort	Fechner, Nikolas
collection	PubMed
description	BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS: We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.
format	Text
id	pubmed-2851576
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28515762010-04-09 Estimation of the applicability domain of kernel-based machine learning models for virtual screening Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas J Cheminform Research article BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS: We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway. BioMed Central 2010-03-11 /pmc/articles/PMC2851576/ /pubmed/20222949 http://dx.doi.org/10.1186/1758-2946-2-2 Text en Copyright ©2010 Fechner et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title	Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_full	Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_fullStr	Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_full_unstemmed	Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_short	Estimation of the applicability domain of kernel-based machine learning models for virtual screening
title_sort	estimation of the applicability domain of kernel-based machine learning models for virtual screening
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851576/ https://www.ncbi.nlm.nih.gov/pubmed/20222949 http://dx.doi.org/10.1186/1758-2946-2-2
work_keys_str_mv	AT fechnernikolas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening AT jahnandreas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening AT hinselmanngeorg estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening AT zellandreas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening

Estimation of the applicability domain of kernel-based machine learning models for virtual screening

Ejemplares similares