Cargando…

iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets

Knowledge of protein-protein interactions and their binding sites is indispensable for in-depth understanding of the networks in living cells. With the avalanche of protein sequences generated in the postgenomic age, it is critical to develop computational methods for identifying in a timely fashion...

Descripción completa

Detalles Bibliográficos
Autores principales: Jia, Jianhua, Liu, Zi, Xiao, Xuan, Liu, Bingxiang, Chou, Kuo-Chen
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274413/
https://www.ncbi.nlm.nih.gov/pubmed/26797600
http://dx.doi.org/10.3390/molecules21010095
_version_ 1783377612465242112
author Jia, Jianhua
Liu, Zi
Xiao, Xuan
Liu, Bingxiang
Chou, Kuo-Chen
author_facet Jia, Jianhua
Liu, Zi
Xiao, Xuan
Liu, Bingxiang
Chou, Kuo-Chen
author_sort Jia, Jianhua
collection PubMed
description Knowledge of protein-protein interactions and their binding sites is indispensable for in-depth understanding of the networks in living cells. With the avalanche of protein sequences generated in the postgenomic age, it is critical to develop computational methods for identifying in a timely fashion the protein-protein binding sites (PPBSs) based on the sequence information alone because the information obtained by this way can be used for both biomedical research and drug development. To address such a challenge, we have proposed a new predictor, called iPPBS-Opt, in which we have used: (1) the K-Nearest Neighbors Cleaning (KNNC) and Inserting Hypothetical Training Samples (IHTS) treatments to optimize the training dataset; (2) the ensemble voting approach to select the most relevant features; and (3) the stationary wavelet transform to formulate the statistical samples. Cross-validation tests by targeting the experiment-confirmed results have demonstrated that the new predictor is very promising, implying that the aforementioned practices are indeed very effective. Particularly, the approach of using the wavelets to express protein/peptide sequences might be the key in grasping the problem’s essence, fully consistent with the findings that many important biological functions of proteins can be elucidated with their low-frequency internal motions. To maximize the convenience of most experimental scientists, we have provided a step-by-step guide on how to use the predictor’s web server (http://www.jci-bioinfo.cn/iPPBS-Opt) to get the desired results without the need to go through the complicated mathematical equations involved.
format Online
Article
Text
id pubmed-6274413
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-62744132018-12-28 iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets Jia, Jianhua Liu, Zi Xiao, Xuan Liu, Bingxiang Chou, Kuo-Chen Molecules Article Knowledge of protein-protein interactions and their binding sites is indispensable for in-depth understanding of the networks in living cells. With the avalanche of protein sequences generated in the postgenomic age, it is critical to develop computational methods for identifying in a timely fashion the protein-protein binding sites (PPBSs) based on the sequence information alone because the information obtained by this way can be used for both biomedical research and drug development. To address such a challenge, we have proposed a new predictor, called iPPBS-Opt, in which we have used: (1) the K-Nearest Neighbors Cleaning (KNNC) and Inserting Hypothetical Training Samples (IHTS) treatments to optimize the training dataset; (2) the ensemble voting approach to select the most relevant features; and (3) the stationary wavelet transform to formulate the statistical samples. Cross-validation tests by targeting the experiment-confirmed results have demonstrated that the new predictor is very promising, implying that the aforementioned practices are indeed very effective. Particularly, the approach of using the wavelets to express protein/peptide sequences might be the key in grasping the problem’s essence, fully consistent with the findings that many important biological functions of proteins can be elucidated with their low-frequency internal motions. To maximize the convenience of most experimental scientists, we have provided a step-by-step guide on how to use the predictor’s web server (http://www.jci-bioinfo.cn/iPPBS-Opt) to get the desired results without the need to go through the complicated mathematical equations involved. MDPI 2016-01-19 /pmc/articles/PMC6274413/ /pubmed/26797600 http://dx.doi.org/10.3390/molecules21010095 Text en © 2016 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Jia, Jianhua
Liu, Zi
Xiao, Xuan
Liu, Bingxiang
Chou, Kuo-Chen
iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets
title iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets
title_full iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets
title_fullStr iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets
title_full_unstemmed iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets
title_short iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets
title_sort ippbs-opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274413/
https://www.ncbi.nlm.nih.gov/pubmed/26797600
http://dx.doi.org/10.3390/molecules21010095
work_keys_str_mv AT jiajianhua ippbsoptasequencebasedensembleclassifierforidentifyingproteinproteinbindingsitesbyoptimizingimbalancedtrainingdatasets
AT liuzi ippbsoptasequencebasedensembleclassifierforidentifyingproteinproteinbindingsitesbyoptimizingimbalancedtrainingdatasets
AT xiaoxuan ippbsoptasequencebasedensembleclassifierforidentifyingproteinproteinbindingsitesbyoptimizingimbalancedtrainingdatasets
AT liubingxiang ippbsoptasequencebasedensembleclassifierforidentifyingproteinproteinbindingsitesbyoptimizingimbalancedtrainingdatasets
AT choukuochen ippbsoptasequencebasedensembleclassifierforidentifyingproteinproteinbindingsitesbyoptimizingimbalancedtrainingdatasets