Cargando…
Estimation of the applicability domain of kernel-based machine learning models for virtual screening
BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give relia...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851576/ https://www.ncbi.nlm.nih.gov/pubmed/20222949 http://dx.doi.org/10.1186/1758-2946-2-2 |
_version_ | 1782179875046555648 |
---|---|
author | Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas |
author_facet | Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas |
author_sort | Fechner, Nikolas |
collection | PubMed |
description | BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS: We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway. |
format | Text |
id | pubmed-2851576 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-28515762010-04-09 Estimation of the applicability domain of kernel-based machine learning models for virtual screening Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas J Cheminform Research article BACKGROUND: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. RESULTS: We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. CONCLUSION: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway. BioMed Central 2010-03-11 /pmc/articles/PMC2851576/ /pubmed/20222949 http://dx.doi.org/10.1186/1758-2946-2-2 Text en Copyright ©2010 Fechner et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research article Fechner, Nikolas Jahn, Andreas Hinselmann, Georg Zell, Andreas Estimation of the applicability domain of kernel-based machine learning models for virtual screening |
title | Estimation of the applicability domain of kernel-based machine learning models for virtual screening |
title_full | Estimation of the applicability domain of kernel-based machine learning models for virtual screening |
title_fullStr | Estimation of the applicability domain of kernel-based machine learning models for virtual screening |
title_full_unstemmed | Estimation of the applicability domain of kernel-based machine learning models for virtual screening |
title_short | Estimation of the applicability domain of kernel-based machine learning models for virtual screening |
title_sort | estimation of the applicability domain of kernel-based machine learning models for virtual screening |
topic | Research article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2851576/ https://www.ncbi.nlm.nih.gov/pubmed/20222949 http://dx.doi.org/10.1186/1758-2946-2-2 |
work_keys_str_mv | AT fechnernikolas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening AT jahnandreas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening AT hinselmanngeorg estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening AT zellandreas estimationoftheapplicabilitydomainofkernelbasedmachinelearningmodelsforvirtualscreening |