Cargando…

The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS

BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the av...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tetko, Igor V., M. Lowe, Daniel, Williams, Antony J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724158/ https://www.ncbi.nlm.nih.gov/pubmed/26807157 http://dx.doi.org/10.1186/s13321-016-0113-y

_version_	1782411543613276160
author	Tetko, Igor V. M. Lowe, Daniel Williams, Antony J.
author_facet	Tetko, Igor V. M. Lowe, Daniel Williams, Antony J.
author_sort	Tetko, Igor V.
collection	PubMed
description	BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4724158
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-47241582016-01-24 The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS Tetko, Igor V. M. Lowe, Daniel Williams, Antony J. J Cheminform Research Article BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users. Springer International Publishing 2016-01-22 /pmc/articles/PMC4724158/ /pubmed/26807157 http://dx.doi.org/10.1186/s13321-016-0113-y Text en © Tetko et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Tetko, Igor V. M. Lowe, Daniel Williams, Antony J. The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title	The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_full	The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_fullStr	The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_full_unstemmed	The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_short	The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
title_sort	development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from patents
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724158/ https://www.ncbi.nlm.nih.gov/pubmed/26807157 http://dx.doi.org/10.1186/s13321-016-0113-y
work_keys_str_mv	AT tetkoigorv thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT mlowedaniel thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT williamsantonyj thedevelopmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT tetkoigorv developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT mlowedaniel developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents AT williamsantonyj developmentofmodelstopredictmeltingandpyrolysispointdataassociatedwithseveralhundredthousandcompoundsminedfrompatents

The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS

Ejemplares similares