Cargando…

Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study

BACKGROUND: State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the year...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Hongjian, Leung, Kwong-Sak, Wong, Man-Hon, Ballester, Pedro J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4153907/ https://www.ncbi.nlm.nih.gov/pubmed/25159129 http://dx.doi.org/10.1186/1471-2105-15-291

_version_	1782333350554370048
author	Li, Hongjian Leung, Kwong-Sak Wong, Man-Hon Ballester, Pedro J
author_facet	Li, Hongjian Leung, Kwong-Sak Wong, Man-Hon Ballester, Pedro J
author_sort	Li, Hongjian
collection	PubMed
description	BACKGROUND: State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the years, classical scoring functions have reached a plateau in their predictive performance. These assume a predetermined additive functional form for some sophisticated numerical features, and use standard multivariate linear regression (MLR) on experimental data to derive the coefficients. RESULTS: In this study we show that such a simple functional form is detrimental for the prediction performance of a scoring function, and replacing linear regression by machine learning techniques like random forest (RF) can improve prediction performance. We investigate the conditions of applying RF under various contexts and find that given sufficient training samples RF manages to comprehensively capture the non-linearity between structural features and measured binding affinities. Incorporating more structural features and training with more samples can both boost RF performance. In addition, we analyze the importance of structural features to binding affinity prediction using the RF variable importance tool. Lastly, we use Cyscore, a top performing empirical scoring function, as a baseline for comparison study. CONCLUSIONS: Machine-learning scoring functions are fundamentally different from classical scoring functions because the former circumvents the fixed functional form relating structural features with binding affinities. RF, but not MLR, can effectively exploit more structural features and more training samples, leading to higher prediction performance. The future availability of more X-ray crystal structures will further widen the performance gap between RF-based and MLR-based scoring functions. This further stresses the importance of substituting RF for MLR in scoring function development. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-291) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4153907
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41539072014-09-05 Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study Li, Hongjian Leung, Kwong-Sak Wong, Man-Hon Ballester, Pedro J BMC Bioinformatics Research Article BACKGROUND: State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the years, classical scoring functions have reached a plateau in their predictive performance. These assume a predetermined additive functional form for some sophisticated numerical features, and use standard multivariate linear regression (MLR) on experimental data to derive the coefficients. RESULTS: In this study we show that such a simple functional form is detrimental for the prediction performance of a scoring function, and replacing linear regression by machine learning techniques like random forest (RF) can improve prediction performance. We investigate the conditions of applying RF under various contexts and find that given sufficient training samples RF manages to comprehensively capture the non-linearity between structural features and measured binding affinities. Incorporating more structural features and training with more samples can both boost RF performance. In addition, we analyze the importance of structural features to binding affinity prediction using the RF variable importance tool. Lastly, we use Cyscore, a top performing empirical scoring function, as a baseline for comparison study. CONCLUSIONS: Machine-learning scoring functions are fundamentally different from classical scoring functions because the former circumvents the fixed functional form relating structural features with binding affinities. RF, but not MLR, can effectively exploit more structural features and more training samples, leading to higher prediction performance. The future availability of more X-ray crystal structures will further widen the performance gap between RF-based and MLR-based scoring functions. This further stresses the importance of substituting RF for MLR in scoring function development. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-291) contains supplementary material, which is available to authorized users. BioMed Central 2014-08-27 /pmc/articles/PMC4153907/ /pubmed/25159129 http://dx.doi.org/10.1186/1471-2105-15-291 Text en © Li et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Li, Hongjian Leung, Kwong-Sak Wong, Man-Hon Ballester, Pedro J Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study
title	Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study
title_full	Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study
title_fullStr	Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study
title_full_unstemmed	Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study
title_short	Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study
title_sort	substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4153907/ https://www.ncbi.nlm.nih.gov/pubmed/25159129 http://dx.doi.org/10.1186/1471-2105-15-291
work_keys_str_mv	AT lihongjian substitutingrandomforestformultiplelinearregressionimprovesbindingaffinitypredictionofscoringfunctionscyscoreasacasestudy AT leungkwongsak substitutingrandomforestformultiplelinearregressionimprovesbindingaffinitypredictionofscoringfunctionscyscoreasacasestudy AT wongmanhon substitutingrandomforestformultiplelinearregressionimprovesbindingaffinitypredictionofscoringfunctionscyscoreasacasestudy AT ballesterpedroj substitutingrandomforestformultiplelinearregressionimprovesbindingaffinitypredictionofscoringfunctionscyscoreasacasestudy

Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study

Ejemplares similares