Cargando…

Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences

BACKGROUND: Ubiquitination is a very important process in protein post-translational modification, which has been widely investigated by biology scientists and researchers. Different experimental and computational methods have been developed to identify the ubiquitination sites in protein sequences....

Descripción completa

Detalles Bibliográficos
Autores principales: Cai, Binghuang, Jiang, Xia
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4778322/
https://www.ncbi.nlm.nih.gov/pubmed/26940649
http://dx.doi.org/10.1186/s12859-016-0959-z
_version_ 1782419444052525056
author Cai, Binghuang
Jiang, Xia
author_facet Cai, Binghuang
Jiang, Xia
author_sort Cai, Binghuang
collection PubMed
description BACKGROUND: Ubiquitination is a very important process in protein post-translational modification, which has been widely investigated by biology scientists and researchers. Different experimental and computational methods have been developed to identify the ubiquitination sites in protein sequences. This paper aims at exploring computational machine learning methods for the prediction of ubiquitination sites using the physicochemical properties (PCPs) of amino acids in the protein sequences. RESULTS: We first establish six different ubiquitination data sets, whose records contain both ubiquitination sites and non-ubiquitination sites in variant numbers of protein sequence segments. In particular, to establish such data sets, protein sequence segments are extracted from the original protein sequences used in four published papers on ubiquitination, while 531 PCP features of each extracted protein sequence segment are calculated based on PCP values from AAindex (Amino Acid index database) by averaging PCP values of all amino acids on each segment. Various computational machine-learning methods, including four Bayesian network methods (i.e., Naïve Bayes (NB), Feature Selection NB (FSNB), Model Averaged NB (MANB), and Efficient Bayesian Multivariate Classifier (EBMC)) and three regression methods (i.e., Support Vector Machine (SVM), Logistic Regression (LR), and Least Absolute Shrinkage and Selection Operator (LASSO)), are then applied to the six established segment-PCP data sets. Five-fold cross-validation and the Area Under Receiver Operating Characteristic Curve (AUROC) are employed to evaluate the ubiquitination prediction performance of each method. Results demonstrate that the PCP data of protein sequences contain information that could be mined by machine learning methods for ubiquitination site prediction. The comparative results show that EBMC, SVM and LR perform better than other methods, and EBMC is the only method that can get AUCs greater than or equal to 0.6 for the six established data sets. Results also show EBMC tends to perform better for larger data. CONCLUSIONS: Machine learning methods have been employed for the ubiquitination site prediction based on physicochemical properties of amino acids on protein sequences. Results demonstrate the effectiveness of using machine learning methodology to mine information from PCP data concerning protein sequences, as well as the superiority of EBMC, SVM and LR (especially EBMC) for the ubiquitination prediction compared to other methods. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0959-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4778322
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47783222016-03-05 Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences Cai, Binghuang Jiang, Xia BMC Bioinformatics Research Article BACKGROUND: Ubiquitination is a very important process in protein post-translational modification, which has been widely investigated by biology scientists and researchers. Different experimental and computational methods have been developed to identify the ubiquitination sites in protein sequences. This paper aims at exploring computational machine learning methods for the prediction of ubiquitination sites using the physicochemical properties (PCPs) of amino acids in the protein sequences. RESULTS: We first establish six different ubiquitination data sets, whose records contain both ubiquitination sites and non-ubiquitination sites in variant numbers of protein sequence segments. In particular, to establish such data sets, protein sequence segments are extracted from the original protein sequences used in four published papers on ubiquitination, while 531 PCP features of each extracted protein sequence segment are calculated based on PCP values from AAindex (Amino Acid index database) by averaging PCP values of all amino acids on each segment. Various computational machine-learning methods, including four Bayesian network methods (i.e., Naïve Bayes (NB), Feature Selection NB (FSNB), Model Averaged NB (MANB), and Efficient Bayesian Multivariate Classifier (EBMC)) and three regression methods (i.e., Support Vector Machine (SVM), Logistic Regression (LR), and Least Absolute Shrinkage and Selection Operator (LASSO)), are then applied to the six established segment-PCP data sets. Five-fold cross-validation and the Area Under Receiver Operating Characteristic Curve (AUROC) are employed to evaluate the ubiquitination prediction performance of each method. Results demonstrate that the PCP data of protein sequences contain information that could be mined by machine learning methods for ubiquitination site prediction. The comparative results show that EBMC, SVM and LR perform better than other methods, and EBMC is the only method that can get AUCs greater than or equal to 0.6 for the six established data sets. Results also show EBMC tends to perform better for larger data. CONCLUSIONS: Machine learning methods have been employed for the ubiquitination site prediction based on physicochemical properties of amino acids on protein sequences. Results demonstrate the effectiveness of using machine learning methodology to mine information from PCP data concerning protein sequences, as well as the superiority of EBMC, SVM and LR (especially EBMC) for the ubiquitination prediction compared to other methods. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0959-z) contains supplementary material, which is available to authorized users. BioMed Central 2016-03-03 /pmc/articles/PMC4778322/ /pubmed/26940649 http://dx.doi.org/10.1186/s12859-016-0959-z Text en © Cai and Jiang. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Cai, Binghuang
Jiang, Xia
Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences
title Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences
title_full Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences
title_fullStr Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences
title_full_unstemmed Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences
title_short Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences
title_sort computational methods for ubiquitination site prediction using physicochemical properties of protein sequences
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4778322/
https://www.ncbi.nlm.nih.gov/pubmed/26940649
http://dx.doi.org/10.1186/s12859-016-0959-z
work_keys_str_mv AT caibinghuang computationalmethodsforubiquitinationsitepredictionusingphysicochemicalpropertiesofproteinsequences
AT jiangxia computationalmethodsforubiquitinationsitepredictionusingphysicochemicalpropertiesofproteinsequences