Cargando…

Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods

Pupylation is an important posttranslational modification in proteins and plays a key role in the cell function of microorganisms; an accurate prediction of pupylation proteins and specified sites is of great significance for the study of basic biological processes and development of related drugs s...

Descripción completa

Detalles Bibliográficos
Autores principales: Qiu, Wang-Ren, Guan, Meng-Yue, Wang, Qian-Kun, Lou, Li-Liang, Xiao, Xuan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9088680/
https://www.ncbi.nlm.nih.gov/pubmed/35557849
http://dx.doi.org/10.3389/fendo.2022.849549
_version_ 1784704361939599360
author Qiu, Wang-Ren
Guan, Meng-Yue
Wang, Qian-Kun
Lou, Li-Liang
Xiao, Xuan
author_facet Qiu, Wang-Ren
Guan, Meng-Yue
Wang, Qian-Kun
Lou, Li-Liang
Xiao, Xuan
author_sort Qiu, Wang-Ren
collection PubMed
description Pupylation is an important posttranslational modification in proteins and plays a key role in the cell function of microorganisms; an accurate prediction of pupylation proteins and specified sites is of great significance for the study of basic biological processes and development of related drugs since it would greatly save experimental costs and improve work efficiency. In this work, we first constructed a model for identifying pupylation proteins. To improve the pupylation protein prediction model, the KNN scoring matrix model based on functional domain GO annotation and the Word Embedding model were used to extract the features and Random Under-sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE) were applied to balance the dataset. Finally, the balanced data sets were input into Extreme Gradient Boosting (XGBoost). The performance of 10-fold cross-validation shows that accuracy (ACC), Matthew’s correlation coefficient (MCC), and area under the ROC curve (AUC) are 95.23%, 0.8100, and 0.9864, respectively. For the pupylation site prediction model, six feature extraction codes (i.e., TPC, AAI, One-hot, PseAAC, CKSAAP, and Word Embedding) served to extract protein sequence features, and the chi-square test was employed for feature selection. Rigorous 10-fold cross-validations indicated that the accuracies are very high and outperformed its existing counterparts. Finally, for the convenience of researchers, PUP-PS-Fuse has been established at https://bioinfo.jcu.edu.cn/PUP-PS-Fuse and http://121.36.221.79/PUP-PS-Fuse/as a backup.
format Online
Article
Text
id pubmed-9088680
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-90886802022-05-11 Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods Qiu, Wang-Ren Guan, Meng-Yue Wang, Qian-Kun Lou, Li-Liang Xiao, Xuan Front Endocrinol (Lausanne) Endocrinology Pupylation is an important posttranslational modification in proteins and plays a key role in the cell function of microorganisms; an accurate prediction of pupylation proteins and specified sites is of great significance for the study of basic biological processes and development of related drugs since it would greatly save experimental costs and improve work efficiency. In this work, we first constructed a model for identifying pupylation proteins. To improve the pupylation protein prediction model, the KNN scoring matrix model based on functional domain GO annotation and the Word Embedding model were used to extract the features and Random Under-sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE) were applied to balance the dataset. Finally, the balanced data sets were input into Extreme Gradient Boosting (XGBoost). The performance of 10-fold cross-validation shows that accuracy (ACC), Matthew’s correlation coefficient (MCC), and area under the ROC curve (AUC) are 95.23%, 0.8100, and 0.9864, respectively. For the pupylation site prediction model, six feature extraction codes (i.e., TPC, AAI, One-hot, PseAAC, CKSAAP, and Word Embedding) served to extract protein sequence features, and the chi-square test was employed for feature selection. Rigorous 10-fold cross-validations indicated that the accuracies are very high and outperformed its existing counterparts. Finally, for the convenience of researchers, PUP-PS-Fuse has been established at https://bioinfo.jcu.edu.cn/PUP-PS-Fuse and http://121.36.221.79/PUP-PS-Fuse/as a backup. Frontiers Media S.A. 2022-04-26 /pmc/articles/PMC9088680/ /pubmed/35557849 http://dx.doi.org/10.3389/fendo.2022.849549 Text en Copyright © 2022 Qiu, Guan, Wang, Lou and Xiao https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Endocrinology
Qiu, Wang-Ren
Guan, Meng-Yue
Wang, Qian-Kun
Lou, Li-Liang
Xiao, Xuan
Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods
title Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods
title_full Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods
title_fullStr Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods
title_full_unstemmed Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods
title_short Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods
title_sort identifying pupylation proteins and sites by incorporating multiple methods
topic Endocrinology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9088680/
https://www.ncbi.nlm.nih.gov/pubmed/35557849
http://dx.doi.org/10.3389/fendo.2022.849549
work_keys_str_mv AT qiuwangren identifyingpupylationproteinsandsitesbyincorporatingmultiplemethods
AT guanmengyue identifyingpupylationproteinsandsitesbyincorporatingmultiplemethods
AT wangqiankun identifyingpupylationproteinsandsitesbyincorporatingmultiplemethods
AT louliliang identifyingpupylationproteinsandsitesbyincorporatingmultiplemethods
AT xiaoxuan identifyingpupylationproteinsandsitesbyincorporatingmultiplemethods