Cargando…

The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction

It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely u...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Hongjian, Peng, Jiangjun, Leung, Yee, Leung, Kwong-Sak, Wong, Man-Hon, Lu, Gang, Ballester, Pedro J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871981/ https://www.ncbi.nlm.nih.gov/pubmed/29538331 http://dx.doi.org/10.3390/biom8010012

_version_	1783309736399077376
author	Li, Hongjian Peng, Jiangjun Leung, Yee Leung, Kwong-Sak Wong, Man-Hon Lu, Gang Ballester, Pedro J.
author_facet	Li, Hongjian Peng, Jiangjun Leung, Yee Leung, Kwong-Sak Wong, Man-Hon Lu, Gang Ballester, Pedro J.
author_sort	Li, Hongjian
collection	PubMed
description	It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.
format	Online Article Text
id	pubmed-5871981
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-58719812018-03-30 The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction Li, Hongjian Peng, Jiangjun Leung, Yee Leung, Kwong-Sak Wong, Man-Hon Lu, Gang Ballester, Pedro J. Biomolecules Article It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future. MDPI 2018-03-14 /pmc/articles/PMC5871981/ /pubmed/29538331 http://dx.doi.org/10.3390/biom8010012 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Li, Hongjian Peng, Jiangjun Leung, Yee Leung, Kwong-Sak Wong, Man-Hon Lu, Gang Ballester, Pedro J. The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction
title	The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction
title_full	The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction
title_fullStr	The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction
title_full_unstemmed	The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction
title_short	The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction
title_sort	impact of protein structure and sequence similarity on the accuracy of machine-learning scoring functions for binding affinity prediction
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5871981/ https://www.ncbi.nlm.nih.gov/pubmed/29538331 http://dx.doi.org/10.3390/biom8010012
work_keys_str_mv	AT lihongjian theimpactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT pengjiangjun theimpactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT leungyee theimpactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT leungkwongsak theimpactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT wongmanhon theimpactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT lugang theimpactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT ballesterpedroj theimpactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT lihongjian impactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT pengjiangjun impactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT leungyee impactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT leungkwongsak impactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT wongmanhon impactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT lugang impactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction AT ballesterpedroj impactofproteinstructureandsequencesimilarityontheaccuracyofmachinelearningscoringfunctionsforbindingaffinityprediction

The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction

Ejemplares similares