Cargando…
Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor
BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking ge...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021185/ https://www.ncbi.nlm.nih.gov/pubmed/24661439 http://dx.doi.org/10.1186/1471-2105-15-82 |
_version_ | 1782316189093986304 |
---|---|
author | Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A |
author_facet | Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A |
author_sort | Bendell, Calem J |
collection | PubMed |
description | BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general. |
format | Online Article Text |
id | pubmed-4021185 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-40211852014-05-28 Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A BMC Bioinformatics Research Article BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. RESULTS: The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. CONCLUSION: Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general. BioMed Central 2014-03-24 /pmc/articles/PMC4021185/ /pubmed/24661439 http://dx.doi.org/10.1186/1471-2105-15-82 Text en Copyright © 2014 Bendell et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Bendell, Calem J Liu, Shalon Aumentado-Armstrong, Tristan Istrate, Bogdan Cernek, Paul T Khan, Samuel Picioreanu, Sergiu Zhao, Michael Murgita, Robert A Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor |
title | Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor |
title_full | Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor |
title_fullStr | Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor |
title_full_unstemmed | Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor |
title_short | Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor |
title_sort | transient protein-protein interface prediction: datasets, features, algorithms, and the rad-t predictor |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4021185/ https://www.ncbi.nlm.nih.gov/pubmed/24661439 http://dx.doi.org/10.1186/1471-2105-15-82 |
work_keys_str_mv | AT bendellcalemj transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT liushalon transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT aumentadoarmstrongtristan transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT istratebogdan transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT cernekpault transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT khansamuel transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT picioreanusergiu transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT zhaomichael transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor AT murgitaroberta transientproteinproteininterfacepredictiondatasetsfeaturesalgorithmsandtheradtpredictor |