Cargando…

Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks

BACKGROUND: Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and anal...

Descripción completa

Detalles Bibliográficos
Autores principales: Yang, Yingxi, Wang, Hui, Li, Wen, Wang, Xiaobo, Wei, Shizhao, Liu, Yulong, Xu, Yan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8010967/
https://www.ncbi.nlm.nih.gov/pubmed/33789579
http://dx.doi.org/10.1186/s12859-021-04101-y
_version_ 1783673160080556032
author Yang, Yingxi
Wang, Hui
Li, Wen
Wang, Xiaobo
Wei, Shizhao
Liu, Yulong
Xu, Yan
author_facet Yang, Yingxi
Wang, Hui
Li, Wen
Wang, Xiaobo
Wei, Shizhao
Liu, Yulong
Xu, Yan
author_sort Yang, Yingxi
collection PubMed
description BACKGROUND: Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. METHOD: We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. RESULTS: In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN. CONCLUSIONS: The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04101-y.
format Online
Article
Text
id pubmed-8010967
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-80109672021-03-31 Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks Yang, Yingxi Wang, Hui Li, Wen Wang, Xiaobo Wei, Shizhao Liu, Yulong Xu, Yan BMC Bioinformatics Research Article BACKGROUND: Protein post-translational modification (PTM) is a key issue to investigate the mechanism of protein’s function. With the rapid development of proteomics technology, a large amount of protein sequence data has been generated, which highlights the importance of the in-depth study and analysis of PTMs in proteins. METHOD: We proposed a new multi-classification machine learning pipeline MultiLyGAN to identity seven types of lysine modified sites. Using eight different sequential and five structural construction methods, 1497 valid features were remained after the filtering by Pearson correlation coefficient. To solve the data imbalance problem, Conditional Generative Adversarial Network (CGAN) and Conditional Wasserstein Generative Adversarial Network (CWGAN), two influential deep generative methods were leveraged and compared to generate new samples for the types with fewer samples. Finally, random forest algorithm was utilized to predict seven categories. RESULTS: In the tenfold cross-validation, accuracy (Acc) and Matthews correlation coefficient (MCC) were 0.8589 and 0.8376, respectively. In the independent test, Acc and MCC were 0.8549 and 0.8330, respectively. The results indicated that CWGAN better solved the existing data imbalance and stabilized the training error. Alternatively, an accumulated feature importance analysis reported that CKSAAP, PWM and structural features were the three most important feature-encoding schemes. MultiLyGAN can be found at https://github.com/Lab-Xu/MultiLyGAN. CONCLUSIONS: The CWGAN greatly improved the predictive performance in all experiments. Features derived from CKSAAP, PWM and structure schemes are the most informative and had the greatest contribution to the prediction of PTM. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04101-y. BioMed Central 2021-03-31 /pmc/articles/PMC8010967/ /pubmed/33789579 http://dx.doi.org/10.1186/s12859-021-04101-y Text en © The Author(s) 2021 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Yang, Yingxi
Wang, Hui
Li, Wen
Wang, Xiaobo
Wei, Shizhao
Liu, Yulong
Xu, Yan
Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_full Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_fullStr Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_full_unstemmed Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_short Prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
title_sort prediction and analysis of multiple protein lysine modified sites based on conditional wasserstein generative adversarial networks
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8010967/
https://www.ncbi.nlm.nih.gov/pubmed/33789579
http://dx.doi.org/10.1186/s12859-021-04101-y
work_keys_str_mv AT yangyingxi predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks
AT wanghui predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks
AT liwen predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks
AT wangxiaobo predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks
AT weishizhao predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks
AT liuyulong predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks
AT xuyan predictionandanalysisofmultipleproteinlysinemodifiedsitesbasedonconditionalwassersteingenerativeadversarialnetworks