Cargando…
Dataset’s chemical diversity limits the generalizability of machine learning predictions
The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functio...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6852905/ https://www.ncbi.nlm.nih.gov/pubmed/33430991 http://dx.doi.org/10.1186/s13321-019-0391-2 |
_version_ | 1783469941658222592 |
---|---|
author | Glavatskikh, Marta Leguy, Jules Hunault, Gilles Cauchy, Thomas Da Mota, Benoit |
author_facet | Glavatskikh, Marta Leguy, Jules Hunault, Gilles Cauchy, Thomas Da Mota, Benoit |
author_sort | Glavatskikh, Marta |
collection | PubMed |
description | The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 “heavy” atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset. [Image: see text] |
format | Online Article Text |
id | pubmed-6852905 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-68529052019-11-21 Dataset’s chemical diversity limits the generalizability of machine learning predictions Glavatskikh, Marta Leguy, Jules Hunault, Gilles Cauchy, Thomas Da Mota, Benoit J Cheminform Research Article The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 “heavy” atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset. [Image: see text] Springer International Publishing 2019-11-12 /pmc/articles/PMC6852905/ /pubmed/33430991 http://dx.doi.org/10.1186/s13321-019-0391-2 Text en © The Author(s) 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Glavatskikh, Marta Leguy, Jules Hunault, Gilles Cauchy, Thomas Da Mota, Benoit Dataset’s chemical diversity limits the generalizability of machine learning predictions |
title | Dataset’s chemical diversity limits the generalizability of machine learning predictions |
title_full | Dataset’s chemical diversity limits the generalizability of machine learning predictions |
title_fullStr | Dataset’s chemical diversity limits the generalizability of machine learning predictions |
title_full_unstemmed | Dataset’s chemical diversity limits the generalizability of machine learning predictions |
title_short | Dataset’s chemical diversity limits the generalizability of machine learning predictions |
title_sort | dataset’s chemical diversity limits the generalizability of machine learning predictions |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6852905/ https://www.ncbi.nlm.nih.gov/pubmed/33430991 http://dx.doi.org/10.1186/s13321-019-0391-2 |
work_keys_str_mv | AT glavatskikhmarta datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions AT leguyjules datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions AT hunaultgilles datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions AT cauchythomas datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions AT damotabenoit datasetschemicaldiversitylimitsthegeneralizabilityofmachinelearningpredictions |