Cargando…

Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets

[Image: see text] Each year, publicly available databases are updated with new compounds from different research institutions. Positive experimental outcomes are more likely to be reported; therefore, they account for a considerable fraction of these entries. Established publicly available databases...

Descripción completa

Detalles Bibliográficos
Autores principales: Smajić, Aljoša, Rami, Iris, Sosnin, Sergey, Ecker, Gerhard F.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Chemical Society 2023
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10445286/
https://www.ncbi.nlm.nih.gov/pubmed/37439496
http://dx.doi.org/10.1021/acs.chemrestox.3c00042
_version_ 1785094141521166336
author Smajić, Aljoša
Rami, Iris
Sosnin, Sergey
Ecker, Gerhard F.
author_facet Smajić, Aljoša
Rami, Iris
Sosnin, Sergey
Ecker, Gerhard F.
author_sort Smajić, Aljoša
collection PubMed
description [Image: see text] Each year, publicly available databases are updated with new compounds from different research institutions. Positive experimental outcomes are more likely to be reported; therefore, they account for a considerable fraction of these entries. Established publicly available databases such as ChEMBL allow researchers to use information without constrictions and create predictive tools for a broad spectrum of applications in the field of toxicology. Therefore, we investigated the distribution of positive and nonpositive entries within ChEMBL for a set of off-targets and its impact on the performance of classification models when applied to pharmaceutical industry data sets. Results indicate that models trained on publicly available data tend to overpredict positives, and models based on industry data sets predict negatives more often than those built using publicly available data sets. This is strengthened even further by the visualization of the prediction space for a set of 10,000 compounds, which makes it possible to identify regions in the chemical space where predictions converge. Finally, we highlight the utilization of these models for consensus modeling for potential adverse events prediction.
format Online
Article
Text
id pubmed-10445286
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Chemical Society
record_format MEDLINE/PubMed
spelling pubmed-104452862023-08-24 Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets Smajić, Aljoša Rami, Iris Sosnin, Sergey Ecker, Gerhard F. Chem Res Toxicol [Image: see text] Each year, publicly available databases are updated with new compounds from different research institutions. Positive experimental outcomes are more likely to be reported; therefore, they account for a considerable fraction of these entries. Established publicly available databases such as ChEMBL allow researchers to use information without constrictions and create predictive tools for a broad spectrum of applications in the field of toxicology. Therefore, we investigated the distribution of positive and nonpositive entries within ChEMBL for a set of off-targets and its impact on the performance of classification models when applied to pharmaceutical industry data sets. Results indicate that models trained on publicly available data tend to overpredict positives, and models based on industry data sets predict negatives more often than those built using publicly available data sets. This is strengthened even further by the visualization of the prediction space for a set of 10,000 compounds, which makes it possible to identify regions in the chemical space where predictions converge. Finally, we highlight the utilization of these models for consensus modeling for potential adverse events prediction. American Chemical Society 2023-07-13 /pmc/articles/PMC10445286/ /pubmed/37439496 http://dx.doi.org/10.1021/acs.chemrestox.3c00042 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Smajić, Aljoša
Rami, Iris
Sosnin, Sergey
Ecker, Gerhard F.
Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets
title Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets
title_full Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets
title_fullStr Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets
title_full_unstemmed Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets
title_short Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets
title_sort identifying differences in the performance of machine learning models for off-targets trained on publicly available and proprietary data sets
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10445286/
https://www.ncbi.nlm.nih.gov/pubmed/37439496
http://dx.doi.org/10.1021/acs.chemrestox.3c00042
work_keys_str_mv AT smajicaljosa identifyingdifferencesintheperformanceofmachinelearningmodelsforofftargetstrainedonpubliclyavailableandproprietarydatasets
AT ramiiris identifyingdifferencesintheperformanceofmachinelearningmodelsforofftargetstrainedonpubliclyavailableandproprietarydatasets
AT sosninsergey identifyingdifferencesintheperformanceofmachinelearningmodelsforofftargetstrainedonpubliclyavailableandproprietarydatasets
AT eckergerhardf identifyingdifferencesintheperformanceofmachinelearningmodelsforofftargetstrainedonpubliclyavailableandproprietarydatasets