Cargando…
MolData, a molecular benchmark for disease and target based machine learning
Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necess...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8899453/ https://www.ncbi.nlm.nih.gov/pubmed/35255958 http://dx.doi.org/10.1186/s13321-022-00590-y |
_version_ | 1784663916836552704 |
---|---|
author | Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun |
author_facet | Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun |
author_sort | Keshavarzi Arshadi, Arash |
collection | PubMed |
description | Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00590-y. |
format | Online Article Text |
id | pubmed-8899453 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-88994532022-03-07 MolData, a molecular benchmark for disease and target based machine learning Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun J Cheminform Research Article Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00590-y. Springer International Publishing 2022-03-07 /pmc/articles/PMC8899453/ /pubmed/35255958 http://dx.doi.org/10.1186/s13321-022-00590-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun MolData, a molecular benchmark for disease and target based machine learning |
title | MolData, a molecular benchmark for disease and target based machine learning |
title_full | MolData, a molecular benchmark for disease and target based machine learning |
title_fullStr | MolData, a molecular benchmark for disease and target based machine learning |
title_full_unstemmed | MolData, a molecular benchmark for disease and target based machine learning |
title_short | MolData, a molecular benchmark for disease and target based machine learning |
title_sort | moldata, a molecular benchmark for disease and target based machine learning |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8899453/ https://www.ncbi.nlm.nih.gov/pubmed/35255958 http://dx.doi.org/10.1186/s13321-022-00590-y |
work_keys_str_mv | AT keshavarziarshadiarash moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning AT salemmilad moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning AT firouzbakhtarash moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning AT yuanjiannshiun moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning |