Cargando…
Balancing Data on Deep Learning-Based Proteochemometric Activity Classification
[Image: see text] In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand–target activity prediction models. However, bioac...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Chemical
Society
2021
|
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8594867/ https://www.ncbi.nlm.nih.gov/pubmed/33779173 http://dx.doi.org/10.1021/acs.jcim.1c00086 |
_version_ | 1784600075488460800 |
---|---|
author | Lopez-del Rio, Angela Picart-Armada, Sergio Perera-Lluna, Alexandre |
author_facet | Lopez-del Rio, Angela Picart-Armada, Sergio Perera-Lluna, Alexandre |
author_sort | Lopez-del Rio, Angela |
collection | PubMed |
description | [Image: see text] In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand–target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target–compound activity classification models while controlling for the compound series bias through clustering. These strategies were (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering, and (4) semi_resampling. These schemas were evaluated in kinases, GPCRs, nuclear receptors, and proteases from BindingDB. We observed that the predicted proportion of positives was driven by the actual data balance in the test set. Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometric model. We recommend a combination of data augmentation and clustering in the training set (semi_resampling) to mitigate the data imbalance effect in a realistic scenario. The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark. |
format | Online Article Text |
id | pubmed-8594867 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | American Chemical
Society |
record_format | MEDLINE/PubMed |
spelling | pubmed-85948672021-11-19 Balancing Data on Deep Learning-Based Proteochemometric Activity Classification Lopez-del Rio, Angela Picart-Armada, Sergio Perera-Lluna, Alexandre J Chem Inf Model [Image: see text] In silico analysis of biological activity data has become an essential technique in pharmaceutical development. Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand–target activity prediction models. However, bioactivity data sets used in proteochemometric modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target–compound activity classification models while controlling for the compound series bias through clustering. These strategies were (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering, and (4) semi_resampling. These schemas were evaluated in kinases, GPCRs, nuclear receptors, and proteases from BindingDB. We observed that the predicted proportion of positives was driven by the actual data balance in the test set. Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometric model. We recommend a combination of data augmentation and clustering in the training set (semi_resampling) to mitigate the data imbalance effect in a realistic scenario. The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark. American Chemical Society 2021-03-29 2021-04-26 /pmc/articles/PMC8594867/ /pubmed/33779173 http://dx.doi.org/10.1021/acs.jcim.1c00086 Text en © 2021 American Chemical Society https://creativecommons.org/licenses/by/4.0/Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Lopez-del Rio, Angela Picart-Armada, Sergio Perera-Lluna, Alexandre Balancing Data on Deep Learning-Based Proteochemometric Activity Classification |
title | Balancing Data on Deep Learning-Based Proteochemometric
Activity Classification |
title_full | Balancing Data on Deep Learning-Based Proteochemometric
Activity Classification |
title_fullStr | Balancing Data on Deep Learning-Based Proteochemometric
Activity Classification |
title_full_unstemmed | Balancing Data on Deep Learning-Based Proteochemometric
Activity Classification |
title_short | Balancing Data on Deep Learning-Based Proteochemometric
Activity Classification |
title_sort | balancing data on deep learning-based proteochemometric
activity classification |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8594867/ https://www.ncbi.nlm.nih.gov/pubmed/33779173 http://dx.doi.org/10.1021/acs.jcim.1c00086 |
work_keys_str_mv | AT lopezdelrioangela balancingdataondeeplearningbasedproteochemometricactivityclassification AT picartarmadasergio balancingdataondeeplearningbasedproteochemometricactivityclassification AT pererallunaalexandre balancingdataondeeplearningbasedproteochemometricactivityclassification |