Cargando…

The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS

BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the av...

Descripción completa

Detalles Bibliográficos
Autores principales: Tetko, Igor V., M. Lowe, Daniel, Williams, Antony J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724158/
https://www.ncbi.nlm.nih.gov/pubmed/26807157
http://dx.doi.org/10.1186/s13321-016-0113-y
_version_ 1782411543613276160
author Tetko, Igor V.
M. Lowe, Daniel
Williams, Antony J.
author_facet Tetko, Igor V.
M. Lowe, Daniel
Williams, Antony J.
author_sort Tetko, Igor V.
collection PubMed
description BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4724158
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-47241582016-01-24 The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS Tetko, Igor V. M. Lowe, Daniel Williams, Antony J. J Cheminform Research Article BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users. Springer International Publishing 2016-01-22 /pmc/articles/PMC4724158/ /pubmed/26807157 http://dx.doi.org/10.1186/s13321-016-0113-y Text en © Tetko et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Tetko, Igor V.
M. Lowe, Daniel
Williams, Antony J.
The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_full The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_fullStr The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_full_unstemmed The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_short The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_sort development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from patents
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724158/
https://www.ncbi.nlm.nih.gov/pubmed/26807157
http://dx.doi.org/10.1186/s13321-016-0113-y
work_keys_str_mv AT tetkoigorv thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents
AT mlowedaniel thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents
AT williamsantonyj thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents
AT tetkoigorv developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents
AT mlowedaniel developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents
AT williamsantonyj developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents