Cargando…

A big data approach to the ultra-fast prediction of DFT-calculated bond energies

BACKGROUND: The rapid access to intrinsic physicochemical properties of molecules is highly desired for large scale chemical data mining explorations such as mass spectrum prediction in metabolomics, toxicity risk assessment and drug discovery. Large volumes of data are being produced by quantum che...

Descripción completa

Detalles Bibliográficos
Autores principales: Qu, Xiaohui, Latino, Diogo ARS, Aires-de-Sousa, Joao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3720218/
https://www.ncbi.nlm.nih.gov/pubmed/23849655
http://dx.doi.org/10.1186/1758-2946-5-34
_version_ 1782277940216594432
author Qu, Xiaohui
Latino, Diogo ARS
Aires-de-Sousa, Joao
author_facet Qu, Xiaohui
Latino, Diogo ARS
Aires-de-Sousa, Joao
author_sort Qu, Xiaohui
collection PubMed
description BACKGROUND: The rapid access to intrinsic physicochemical properties of molecules is highly desired for large scale chemical data mining explorations such as mass spectrum prediction in metabolomics, toxicity risk assessment and drug discovery. Large volumes of data are being produced by quantum chemistry calculations, which provide increasing accurate estimations of several properties, e.g. by Density Functional Theory (DFT), but are still too computationally expensive for those large scale uses. This work explores the possibility of using large amounts of data generated by DFT methods for thousands of molecular structures, extracting relevant molecular properties and applying machine learning (ML) algorithms to learn from the data. Once trained, these ML models can be applied to new structures to produce ultra-fast predictions. An approach is presented for homolytic bond dissociation energy (BDE). RESULTS: Machine learning models were trained with a data set of >12,000 BDEs calculated by B3LYP/6-311++G(d,p)//DFTB. Descriptors were designed to encode atom types and connectivity in the 2D topological environment of the bonds. The best model, an Associative Neural Network (ASNN) based on 85 bond descriptors, was able to predict the BDE of 887 bonds in an independent test set (covering a range of 17.67–202.30 kcal/mol) with RMSD of 5.29 kcal/mol, mean absolute deviation of 3.35 kcal/mol, and R(2) = 0.953. The predictions were compared with semi-empirical PM6 calculations, and were found to be superior for all types of bonds in the data set, except for O-H, N-H, and N-N bonds. The B3LYP/6-311++G(d,p)//DFTB calculations can approach the higher-level calculations B3LYP/6-311++G(3df,2p)//B3LYP/6-31G(d,p) with an RMSD of 3.04 kcal/mol, which is less than the RMSD of ASNN (against both DFT methods). An experimental web service for on-line prediction of BDEs is available at http://joao.airesdesousa.com/bde. CONCLUSION: Knowledge could be automatically extracted by machine learning techniques from a data set of calculated BDEs, providing ultra-fast access to accurate estimations of DFT-calculated BDEs. This demonstrates how to extract value from large volumes of data currently being produced by quantum chemistry calculations at an increasing speed mostly without human intervention. In this way, high-level theoretical quantum calculations can be used in large-scale applications that otherwise would not afford the intrinsic computational cost.
format Online
Article
Text
id pubmed-3720218
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-37202182013-07-26 A big data approach to the ultra-fast prediction of DFT-calculated bond energies Qu, Xiaohui Latino, Diogo ARS Aires-de-Sousa, Joao J Cheminform Research Article BACKGROUND: The rapid access to intrinsic physicochemical properties of molecules is highly desired for large scale chemical data mining explorations such as mass spectrum prediction in metabolomics, toxicity risk assessment and drug discovery. Large volumes of data are being produced by quantum chemistry calculations, which provide increasing accurate estimations of several properties, e.g. by Density Functional Theory (DFT), but are still too computationally expensive for those large scale uses. This work explores the possibility of using large amounts of data generated by DFT methods for thousands of molecular structures, extracting relevant molecular properties and applying machine learning (ML) algorithms to learn from the data. Once trained, these ML models can be applied to new structures to produce ultra-fast predictions. An approach is presented for homolytic bond dissociation energy (BDE). RESULTS: Machine learning models were trained with a data set of >12,000 BDEs calculated by B3LYP/6-311++G(d,p)//DFTB. Descriptors were designed to encode atom types and connectivity in the 2D topological environment of the bonds. The best model, an Associative Neural Network (ASNN) based on 85 bond descriptors, was able to predict the BDE of 887 bonds in an independent test set (covering a range of 17.67–202.30 kcal/mol) with RMSD of 5.29 kcal/mol, mean absolute deviation of 3.35 kcal/mol, and R(2) = 0.953. The predictions were compared with semi-empirical PM6 calculations, and were found to be superior for all types of bonds in the data set, except for O-H, N-H, and N-N bonds. The B3LYP/6-311++G(d,p)//DFTB calculations can approach the higher-level calculations B3LYP/6-311++G(3df,2p)//B3LYP/6-31G(d,p) with an RMSD of 3.04 kcal/mol, which is less than the RMSD of ASNN (against both DFT methods). An experimental web service for on-line prediction of BDEs is available at http://joao.airesdesousa.com/bde. CONCLUSION: Knowledge could be automatically extracted by machine learning techniques from a data set of calculated BDEs, providing ultra-fast access to accurate estimations of DFT-calculated BDEs. This demonstrates how to extract value from large volumes of data currently being produced by quantum chemistry calculations at an increasing speed mostly without human intervention. In this way, high-level theoretical quantum calculations can be used in large-scale applications that otherwise would not afford the intrinsic computational cost. BioMed Central 2013-07-12 /pmc/articles/PMC3720218/ /pubmed/23849655 http://dx.doi.org/10.1186/1758-2946-5-34 Text en Copyright © 2013 Qu et al.; licensee Chemistry Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Qu, Xiaohui
Latino, Diogo ARS
Aires-de-Sousa, Joao
A big data approach to the ultra-fast prediction of DFT-calculated bond energies
title A big data approach to the ultra-fast prediction of DFT-calculated bond energies
title_full A big data approach to the ultra-fast prediction of DFT-calculated bond energies
title_fullStr A big data approach to the ultra-fast prediction of DFT-calculated bond energies
title_full_unstemmed A big data approach to the ultra-fast prediction of DFT-calculated bond energies
title_short A big data approach to the ultra-fast prediction of DFT-calculated bond energies
title_sort big data approach to the ultra-fast prediction of dft-calculated bond energies
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3720218/
https://www.ncbi.nlm.nih.gov/pubmed/23849655
http://dx.doi.org/10.1186/1758-2946-5-34
work_keys_str_mv AT quxiaohui abigdataapproachtotheultrafastpredictionofdftcalculatedbondenergies
AT latinodiogoars abigdataapproachtotheultrafastpredictionofdftcalculatedbondenergies
AT airesdesousajoao abigdataapproachtotheultrafastpredictionofdftcalculatedbondenergies
AT quxiaohui bigdataapproachtotheultrafastpredictionofdftcalculatedbondenergies
AT latinodiogoars bigdataapproachtotheultrafastpredictionofdftcalculatedbondenergies
AT airesdesousajoao bigdataapproachtotheultrafastpredictionofdftcalculatedbondenergies