Cargando…

Large-Scale Modeling of Sparse Protein Kinase Activity Data

[Image: see text] Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against dif...

Descripción completa

Detalles Bibliográficos
Autores principales:	Luukkonen, Sohvi, Meijer, Erik, Tricarico, Giovanni A., Hofmans, Johan, Stouten, Pieter F. W., van Westen, Gerard J. P., Lenselink, Eelke B.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Chemical Society 2023
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10302492/ https://www.ncbi.nlm.nih.gov/pubmed/37294674 http://dx.doi.org/10.1021/acs.jcim.3c00132

_version_	1785065057494761472
author	Luukkonen, Sohvi Meijer, Erik Tricarico, Giovanni A. Hofmans, Johan Stouten, Pieter F. W. van Westen, Gerard J. P. Lenselink, Eelke B.
author_facet	Luukkonen, Sohvi Meijer, Erik Tricarico, Giovanni A. Hofmans, Johan Stouten, Pieter F. W. van Westen, Gerard J. P. Lenselink, Eelke B.
author_sort	Luukkonen, Sohvi
collection	PubMed
description	[Image: see text] Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train–test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.
format	Online Article Text
id	pubmed-10302492
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	American Chemical Society
record_format	MEDLINE/PubMed
spelling	pubmed-103024922023-06-29 Large-Scale Modeling of Sparse Protein Kinase Activity Data Luukkonen, Sohvi Meijer, Erik Tricarico, Giovanni A. Hofmans, Johan Stouten, Pieter F. W. van Westen, Gerard J. P. Lenselink, Eelke B. J Chem Inf Model [Image: see text] Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train–test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set. American Chemical Society 2023-06-09 /pmc/articles/PMC10302492/ /pubmed/37294674 http://dx.doi.org/10.1021/acs.jcim.3c00132 Text en © 2023 The Authors. Published by American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Luukkonen, Sohvi Meijer, Erik Tricarico, Giovanni A. Hofmans, Johan Stouten, Pieter F. W. van Westen, Gerard J. P. Lenselink, Eelke B. Large-Scale Modeling of Sparse Protein Kinase Activity Data
title	Large-Scale Modeling of Sparse Protein Kinase Activity Data
title_full	Large-Scale Modeling of Sparse Protein Kinase Activity Data
title_fullStr	Large-Scale Modeling of Sparse Protein Kinase Activity Data
title_full_unstemmed	Large-Scale Modeling of Sparse Protein Kinase Activity Data
title_short	Large-Scale Modeling of Sparse Protein Kinase Activity Data
title_sort	large-scale modeling of sparse protein kinase activity data
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10302492/ https://www.ncbi.nlm.nih.gov/pubmed/37294674 http://dx.doi.org/10.1021/acs.jcim.3c00132
work_keys_str_mv	AT luukkonensohvi largescalemodelingofsparseproteinkinaseactivitydata AT meijererik largescalemodelingofsparseproteinkinaseactivitydata AT tricaricogiovannia largescalemodelingofsparseproteinkinaseactivitydata AT hofmansjohan largescalemodelingofsparseproteinkinaseactivitydata AT stoutenpieterfw largescalemodelingofsparseproteinkinaseactivitydata AT vanwestengerardjp largescalemodelingofsparseproteinkinaseactivitydata AT lenselinkeelkeb largescalemodelingofsparseproteinkinaseactivitydata

Large-Scale Modeling of Sparse Protein Kinase Activity Data

Ejemplares similares