Cargando…

Pushing the limits of solubility prediction via quality-oriented data selection

Accurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of d...

Descripción completa

Detalles Bibliográficos
Autores principales: Sorkun, Murat Cihan, Koelman, J.M. Vianney A., Er, Süleyman
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7788089/
https://www.ncbi.nlm.nih.gov/pubmed/33437941
http://dx.doi.org/10.1016/j.isci.2020.101961
_version_ 1783632960790986752
author Sorkun, Murat Cihan
Koelman, J.M. Vianney A.
Er, Süleyman
author_facet Sorkun, Murat Cihan
Koelman, J.M. Vianney A.
Er, Süleyman
author_sort Sorkun, Murat Cihan
collection PubMed
description Accurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of data sets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved.
format Online
Article
Text
id pubmed-7788089
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-77880892021-01-11 Pushing the limits of solubility prediction via quality-oriented data selection Sorkun, Murat Cihan Koelman, J.M. Vianney A. Er, Süleyman iScience Article Accurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of data sets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved. Elsevier 2020-12-17 /pmc/articles/PMC7788089/ /pubmed/33437941 http://dx.doi.org/10.1016/j.isci.2020.101961 Text en © 2020 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Sorkun, Murat Cihan
Koelman, J.M. Vianney A.
Er, Süleyman
Pushing the limits of solubility prediction via quality-oriented data selection
title Pushing the limits of solubility prediction via quality-oriented data selection
title_full Pushing the limits of solubility prediction via quality-oriented data selection
title_fullStr Pushing the limits of solubility prediction via quality-oriented data selection
title_full_unstemmed Pushing the limits of solubility prediction via quality-oriented data selection
title_short Pushing the limits of solubility prediction via quality-oriented data selection
title_sort pushing the limits of solubility prediction via quality-oriented data selection
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7788089/
https://www.ncbi.nlm.nih.gov/pubmed/33437941
http://dx.doi.org/10.1016/j.isci.2020.101961
work_keys_str_mv AT sorkunmuratcihan pushingthelimitsofsolubilitypredictionviaqualityorienteddataselection
AT koelmanjmvianneya pushingthelimitsofsolubilitypredictionviaqualityorienteddataselection
AT ersuleyman pushingthelimitsofsolubilitypredictionviaqualityorienteddataselection