Cargando…

MolData, a molecular benchmark for disease and target based machine learning

Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necess...

Descripción completa

Detalles Bibliográficos
Autores principales:	Keshavarzi Arshadi, Arash, Salem, Milad, Firouzbakht, Arash, Yuan, Jiann Shiun
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8899453/ https://www.ncbi.nlm.nih.gov/pubmed/35255958 http://dx.doi.org/10.1186/s13321-022-00590-y

_version_	1784663916836552704
author	Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun
author_facet	Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun
author_sort	Keshavarzi Arshadi, Arash
collection	PubMed
description	Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00590-y.
format	Online Article Text
id	pubmed-8899453
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-88994532022-03-07 MolData, a molecular benchmark for disease and target based machine learning Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun J Cheminform Research Article Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-022-00590-y. Springer International Publishing 2022-03-07 /pmc/articles/PMC8899453/ /pubmed/35255958 http://dx.doi.org/10.1186/s13321-022-00590-y Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Keshavarzi Arshadi, Arash Salem, Milad Firouzbakht, Arash Yuan, Jiann Shiun MolData, a molecular benchmark for disease and target based machine learning
title	MolData, a molecular benchmark for disease and target based machine learning
title_full	MolData, a molecular benchmark for disease and target based machine learning
title_fullStr	MolData, a molecular benchmark for disease and target based machine learning
title_full_unstemmed	MolData, a molecular benchmark for disease and target based machine learning
title_short	MolData, a molecular benchmark for disease and target based machine learning
title_sort	moldata, a molecular benchmark for disease and target based machine learning
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8899453/ https://www.ncbi.nlm.nih.gov/pubmed/35255958 http://dx.doi.org/10.1186/s13321-022-00590-y
work_keys_str_mv	AT keshavarziarshadiarash moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning AT salemmilad moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning AT firouzbakhtarash moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning AT yuanjiannshiun moldataamolecularbenchmarkfordiseaseandtargetbasedmachinelearning

MolData, a molecular benchmark for disease and target based machine learning

Ejemplares similares