Cargando…
The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the av...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724158/ https://www.ncbi.nlm.nih.gov/pubmed/26807157 http://dx.doi.org/10.1186/s13321-016-0113-y |
_version_ | 1782411543613276160 |
---|---|
author | Tetko, Igor V. M. Lowe, Daniel Williams, Antony J. |
author_facet | Tetko, Igor V. M. Lowe, Daniel Williams, Antony J. |
author_sort | Tetko, Igor V. |
collection | PubMed |
description | BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4724158 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-47241582016-01-24 The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS Tetko, Igor V. M. Lowe, Daniel Williams, Antony J. J Cheminform Research Article BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users. Springer International Publishing 2016-01-22 /pmc/articles/PMC4724158/ /pubmed/26807157 http://dx.doi.org/10.1186/s13321-016-0113-y Text en © Tetko et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Tetko, Igor V. M. Lowe, Daniel Williams, Antony J. The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS |
title | The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS |
title_full | The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS |
title_fullStr | The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS |
title_full_unstemmed | The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS |
title_short | The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS |
title_sort | development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from patents |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724158/ https://www.ncbi.nlm.nih.gov/pubmed/26807157 http://dx.doi.org/10.1186/s13321-016-0113-y |
work_keys_str_mv | AT tetkoigorv thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT mlowedaniel thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT williamsantonyj thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT tetkoigorv developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT mlowedaniel developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT williamsantonyj developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents |