Cargando…

MolData, a molecular benchmark for disease and target based machine learning

Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necess...

Descripción completa

Detalles Bibliográficos
Autores principales: Keshavarzi Arshadi, Arash, Salem, Milad, Firouzbakht, Arash, Yuan, Jiann Shiun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8899453/
https://www.ncbi.nlm.nih.gov/pubmed/35255958
http://dx.doi.org/10.1186/s13321-022-00590-y
_version_ 1784663916836552704
author Keshavarzi Arshadi, Arash
Salem, Milad
Firouzbakht, Arash
Yuan, Jiann Shiun
author_facet Keshavarzi Arshadi, Arash
Salem, Milad
Firouzbakht, Arash
Yuan, Jiann Shiun
author_sort Keshavarzi Arshadi, Arash
collection PubMed
description Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00590-y.
format Online
Article
Text
id pubmed-8899453
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-88994532022-03-07 MolData, a molecular benchmark for disease and target based machine learning Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun J Cheminform Research Article Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00590-y. Springer International Publishing 2022-03-07 /pmc/articles/PMC8899453/ /pubmed/35255958 http://dx.doi.org/10.1186/s13321-022-00590-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Keshavarzi Arshadi, Arash
Salem, Milad
Firouzbakht, Arash
Yuan, Jiann Shiun
MolData, a molecular benchmark for disease and target based machine learning
title MolData, a molecular benchmark for disease and target based machine learning
title_full MolData, a molecular benchmark for disease and target based machine learning
title_fullStr MolData, a molecular benchmark for disease and target based machine learning
title_full_unstemmed MolData, a molecular benchmark for disease and target based machine learning
title_short MolData, a molecular benchmark for disease and target based machine learning
title_sort moldata, a molecular benchmark for disease and target based machine learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8899453/
https://www.ncbi.nlm.nih.gov/pubmed/35255958
http://dx.doi.org/10.1186/s13321-022-00590-y
work_keys_str_mv AT keshavarziarshadiarash moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning
AT salemmilad moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning
AT firouzbakhtarash moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning
AT yuanjiannshiun moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning