Cargando…

Practical guidelines for the use of gradient boosting for molecular property prediction

Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure–activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science...

Descripción completa

Detalles Bibliográficos
Autores principales:	Boldini, Davide, Grisoni, Francesca, Kuhn, Daniel, Friedrich, Lukas, Sieber, Stephan A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer International Publishing 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10464382/ https://www.ncbi.nlm.nih.gov/pubmed/37641120 http://dx.doi.org/10.1186/s13321-023-00743-7

_version_	1785098457663406080
author	Boldini, Davide Grisoni, Francesca Kuhn, Daniel Friedrich, Lukas Sieber, Stephan A.
author_facet	Boldini, Davide Grisoni, Francesca Kuhn, Daniel Friedrich, Lukas Sieber, Stephan A.
author_sort	Boldini, Davide
collection	PubMed
description	Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure–activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications. GRAPHICAL ABSTRACT: [Image: see text] SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00743-7.
format	Online Article Text
id	pubmed-10464382
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Springer International Publishing
record_format	MEDLINE/PubMed
spelling	pubmed-104643822023-08-30 Practical guidelines for the use of gradient boosting for molecular property prediction Boldini, Davide Grisoni, Francesca Kuhn, Daniel Friedrich, Lukas Sieber, Stephan A. J Cheminform Research Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure–activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications. GRAPHICAL ABSTRACT: [Image: see text] SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13321-023-00743-7. Springer International Publishing 2023-08-28 /pmc/articles/PMC10464382/ /pubmed/37641120 http://dx.doi.org/10.1186/s13321-023-00743-7 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Boldini, Davide Grisoni, Francesca Kuhn, Daniel Friedrich, Lukas Sieber, Stephan A. Practical guidelines for the use of gradient boosting for molecular property prediction
title	Practical guidelines for the use of gradient boosting for molecular property prediction
title_full	Practical guidelines for the use of gradient boosting for molecular property prediction
title_fullStr	Practical guidelines for the use of gradient boosting for molecular property prediction
title_full_unstemmed	Practical guidelines for the use of gradient boosting for molecular property prediction
title_short	Practical guidelines for the use of gradient boosting for molecular property prediction
title_sort	practical guidelines for the use of gradient boosting for molecular property prediction
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10464382/ https://www.ncbi.nlm.nih.gov/pubmed/37641120 http://dx.doi.org/10.1186/s13321-023-00743-7
work_keys_str_mv	AT boldinidavide practicalguidelinesfortheuseofgradientboostingformolecularpropertyprediction AT grisonifrancesca practicalguidelinesfortheuseofgradientboostingformolecularpropertyprediction AT kuhndaniel practicalguidelinesfortheuseofgradientboostingformolecularpropertyprediction AT friedrichlukas practicalguidelinesfortheuseofgradientboostingformolecularpropertyprediction AT sieberstephana practicalguidelinesfortheuseofgradientboostingformolecularpropertyprediction

Practical guidelines for the use of gradient boosting for molecular property prediction

Ejemplares similares