Cargando…

GEOM, energy-annotated molecular conformations for property prediction and molecular generation

Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule...

Descripción completa

Detalles Bibliográficos
Autores principales: Axelrod, Simon, Gómez-Bombarelli, Rafael
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9023519/
https://www.ncbi.nlm.nih.gov/pubmed/35449137
http://dx.doi.org/10.1038/s41597-022-01288-4
_version_ 1784690369406959616
author Axelrod, Simon
Gómez-Bombarelli, Rafael
author_facet Axelrod, Simon
Gómez-Bombarelli, Rafael
author_sort Axelrod, Simon
collection PubMed
description Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.
format Online
Article
Text
id pubmed-9023519
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-90235192022-04-28 GEOM, energy-annotated molecular conformations for property prediction and molecular generation Axelrod, Simon Gómez-Bombarelli, Rafael Sci Data Data Descriptor Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations. Nature Publishing Group UK 2022-04-21 /pmc/articles/PMC9023519/ /pubmed/35449137 http://dx.doi.org/10.1038/s41597-022-01288-4 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Data Descriptor
Axelrod, Simon
Gómez-Bombarelli, Rafael
GEOM, energy-annotated molecular conformations for property prediction and molecular generation
title GEOM, energy-annotated molecular conformations for property prediction and molecular generation
title_full GEOM, energy-annotated molecular conformations for property prediction and molecular generation
title_fullStr GEOM, energy-annotated molecular conformations for property prediction and molecular generation
title_full_unstemmed GEOM, energy-annotated molecular conformations for property prediction and molecular generation
title_short GEOM, energy-annotated molecular conformations for property prediction and molecular generation
title_sort geom, energy-annotated molecular conformations for property prediction and molecular generation
topic Data Descriptor
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9023519/
https://www.ncbi.nlm.nih.gov/pubmed/35449137
http://dx.doi.org/10.1038/s41597-022-01288-4
work_keys_str_mv AT axelrodsimon geomenergyannotatedmolecularconformationsforpropertypredictionandmoleculargeneration
AT gomezbombarellirafael geomenergyannotatedmolecularconformationsforpropertypredictionandmoleculargeneration