Cargando…

Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor

BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking ge...

Descripción completa

Detalles Bibliográficos
Autores principales: Bendell, Calem J, Liu, Shalon, Aumentado-Armstrong, Tristan, Istrate, Bogdan, Cernek, Paul T, Khan, Samuel, Picioreanu, Sergiu, Zhao, Michael, Murgita, Robert A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021185/
https://www.ncbi.nlm.nih.gov/pubmed/24661439
http://dx.doi.org/10.1186/1471-2105-15-82
_version_ 1782316189093986304
author Bendell, Calem J
Liu, Shalon
Aumentado-Armstrong, Tristan
Istrate, Bogdan
Cernek, Paul T
Khan, Samuel
Picioreanu, Sergiu
Zhao, Michael
Murgita, Robert A
author_facet Bendell, Calem J
Liu, Shalon
Aumentado-Armstrong, Tristan
Istrate, Bogdan
Cernek, Paul T
Khan, Samuel
Picioreanu, Sergiu
Zhao, Michael
Murgita, Robert A
author_sort Bendell, Calem J
collection PubMed
description BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general.
format Online
Article
Text
id pubmed-4021185
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40211852014-05-28 Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A BMC Bioinformatics Research Article BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general. BioMed Central 2014-03-24 /pmc/articles/PMC4021185/ /pubmed/24661439 http://dx.doi.org/10.1186/1471-2105-15-82 Text en Copyright © 2014 Bendell et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Bendell, Calem J
Liu, Shalon
Aumentado-Armstrong, Tristan
Istrate, Bogdan
Cernek, Paul T
Khan, Samuel
Picioreanu, Sergiu
Zhao, Michael
Murgita, Robert A
Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_full Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_fullStr Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_full_unstemmed Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_short Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
title_sort transient protein-protein interface prediction: datasets, features, algorithms, and the rad-t predictor
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021185/
https://www.ncbi.nlm.nih.gov/pubmed/24661439
http://dx.doi.org/10.1186/1471-2105-15-82
work_keys_str_mv AT bendellcalemj transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT liushalon transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT aumentadoarmstrongtristan transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT istratebogdan transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT cernekpault transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT khansamuel transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT picioreanusergiu transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT zhaomichael transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor
AT murgitaroberta transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor