Cargando…

Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data

Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models pre...

Descripción completa

Detalles Bibliográficos
Autores principales:	Morger, Andrea, Garcia de Lomana, Marina, Norinder, Ulf, Svensson, Fredrik, Kirchmair, Johannes, Mathea, Miriam, Volkamer, Andrea
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9068909/ https://www.ncbi.nlm.nih.gov/pubmed/35508546 http://dx.doi.org/10.1038/s41598-022-09309-3

_version_	1784700319641370624
author	Morger, Andrea Garcia de Lomana, Marina Norinder, Ulf Svensson, Fredrik Kirchmair, Johannes Mathea, Miriam Volkamer, Andrea
author_facet	Morger, Andrea Garcia de Lomana, Marina Norinder, Ulf Svensson, Fredrik Kirchmair, Johannes Mathea, Miriam Volkamer, Andrea
author_sort	Morger, Andrea
collection	PubMed
description	Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.
format	Online Article Text
id	pubmed-9068909
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-90689092022-05-05 Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data Morger, Andrea Garcia de Lomana, Marina Norinder, Ulf Svensson, Fredrik Kirchmair, Johannes Mathea, Miriam Volkamer, Andrea Sci Rep Article Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models. Nature Publishing Group UK 2022-05-04 /pmc/articles/PMC9068909/ /pubmed/35508546 http://dx.doi.org/10.1038/s41598-022-09309-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Morger, Andrea Garcia de Lomana, Marina Norinder, Ulf Svensson, Fredrik Kirchmair, Johannes Mathea, Miriam Volkamer, Andrea Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title	Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_full	Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_fullStr	Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_full_unstemmed	Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_short	Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_sort	studying and mitigating the effects of data drifts on ml model performance at the example of chemical toxicity data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9068909/ https://www.ncbi.nlm.nih.gov/pubmed/35508546 http://dx.doi.org/10.1038/s41598-022-09309-3
work_keys_str_mv	AT morgerandrea studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata AT garciadelomanamarina studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata AT norinderulf studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata AT svenssonfredrik studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata AT kirchmairjohannes studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata AT matheamiriam studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata AT volkamerandrea studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata

Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data

Ejemplares similares