Cargando…

Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor

BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking ge...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bendell, Calem J, Liu, Shalon, Aumentado-Armstrong, Tristan, Istrate, Bogdan, Cernek, Paul T, Khan, Samuel, Picioreanu, Sergiu, Zhao, Michael, Murgita, Robert A
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021185/ https://www.ncbi.nlm.nih.gov/pubmed/24661439 http://dx.doi.org/10.1186/1471-2105-15-82

_version_	1782316189093986304
author	Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A
author_facet	Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A
author_sort	Bendell, Calem J
collection	PubMed
description	BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general.
format	Online Article Text
id	pubmed-4021185
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40211852014-05-28 Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A BMC Bioinformatics Research Article BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general. BioMed Central 2014-03-24 /pmc/articles/PMC4021185/ /pubmed/24661439 http://dx.doi.org/10.1186/1471-2105-15-82 Text en Copyright © 2014 Bendell et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title	Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_full	Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_fullStr	Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_full_unstemmed	Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_short	Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_sort	transient protein-protein interface prediction: datasets, features, algorithms, and the rad-t predictor
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021185/ https://www.ncbi.nlm.nih.gov/pubmed/24661439 http://dx.doi.org/10.1186/1471-2105-15-82
work_keys_str_mv	AT bendellcalemj transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT liushalon transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT aumentadoarmstrongtristan transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT istratebogdan transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT cernekpault transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT khansamuel transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT picioreanusergiu transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT zhaomichael transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT murgitaroberta transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor

Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor

Ejemplares similares