Cargando…

Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model

Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit thei...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Zhan-Heng, You, Zhu-Hong, Zhang, Wen-Bo, Wang, Yan-Bin, Cheng, Li, Alghazzawi, Daniyal
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6896115/
https://www.ncbi.nlm.nih.gov/pubmed/31726752
http://dx.doi.org/10.3390/genes10110924
_version_ 1783476709818892288
author Chen, Zhan-Heng
You, Zhu-Hong
Zhang, Wen-Bo
Wang, Yan-Bin
Cheng, Li
Alghazzawi, Daniyal
author_facet Chen, Zhan-Heng
You, Zhu-Hong
Zhang, Wen-Bo
Wang, Yan-Bin
Cheng, Li
Alghazzawi, Daniyal
author_sort Chen, Zhan-Heng
collection PubMed
description Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit their usage for predicting SIPs. Therefore, the development of computational method emerges at the times require. In this paper, we for the first time proposed a novel deep learning model which combined natural language processing (NLP) method for potential SIPs prediction from the protein sequence information. More specifically, the protein sequence is de novo assembled by k-mers. Then, we obtained the global vectors representation for each protein sequences by using natural language processing (NLP) technique. Finally, based on the knowledge of known self-interacting and non-interacting proteins, a multi-grained cascade forest model is trained to predict SIPs. Comprehensive experiments were performed on yeast and human datasets, which obtained an accuracy rate of 91.45% and 93.12%, respectively. From our evaluations, the experimental results show that the use of amino acid semantics information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work would have potential applications for various biological classification problems.
format Online
Article
Text
id pubmed-6896115
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-68961152019-12-23 Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model Chen, Zhan-Heng You, Zhu-Hong Zhang, Wen-Bo Wang, Yan-Bin Cheng, Li Alghazzawi, Daniyal Genes (Basel) Article Self-interacting proteins (SIPs) is of paramount importance in current molecular biology. There have been developed a number of traditional biological experiment methods for predicting SIPs in the past few years. However, these methods are costly, time-consuming and inefficient, and often limit their usage for predicting SIPs. Therefore, the development of computational method emerges at the times require. In this paper, we for the first time proposed a novel deep learning model which combined natural language processing (NLP) method for potential SIPs prediction from the protein sequence information. More specifically, the protein sequence is de novo assembled by k-mers. Then, we obtained the global vectors representation for each protein sequences by using natural language processing (NLP) technique. Finally, based on the knowledge of known self-interacting and non-interacting proteins, a multi-grained cascade forest model is trained to predict SIPs. Comprehensive experiments were performed on yeast and human datasets, which obtained an accuracy rate of 91.45% and 93.12%, respectively. From our evaluations, the experimental results show that the use of amino acid semantics information is very helpful for addressing the problem of sequences containing both self-interacting and non-interacting pairs of proteins. This work would have potential applications for various biological classification problems. MDPI 2019-11-12 /pmc/articles/PMC6896115/ /pubmed/31726752 http://dx.doi.org/10.3390/genes10110924 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Chen, Zhan-Heng
You, Zhu-Hong
Zhang, Wen-Bo
Wang, Yan-Bin
Cheng, Li
Alghazzawi, Daniyal
Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model
title Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model
title_full Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model
title_fullStr Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model
title_full_unstemmed Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model
title_short Global Vectors Representation of Protein Sequences and Its Application for Predicting Self-Interacting Proteins with Multi-Grained Cascade Forest Model
title_sort global vectors representation of protein sequences and its application for predicting self-interacting proteins with multi-grained cascade forest model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6896115/
https://www.ncbi.nlm.nih.gov/pubmed/31726752
http://dx.doi.org/10.3390/genes10110924
work_keys_str_mv AT chenzhanheng globalvectorsrepresentationofproteinsequencesanditsapplicationforpredictingselfinteractingproteinswithmultigrainedcascadeforestmodel
AT youzhuhong globalvectorsrepresentationofproteinsequencesanditsapplicationforpredictingselfinteractingproteinswithmultigrainedcascadeforestmodel
AT zhangwenbo globalvectorsrepresentationofproteinsequencesanditsapplicationforpredictingselfinteractingproteinswithmultigrainedcascadeforestmodel
AT wangyanbin globalvectorsrepresentationofproteinsequencesanditsapplicationforpredictingselfinteractingproteinswithmultigrainedcascadeforestmodel
AT chengli globalvectorsrepresentationofproteinsequencesanditsapplicationforpredictingselfinteractingproteinswithmultigrainedcascadeforestmodel
AT alghazzawidaniyal globalvectorsrepresentationofproteinsequencesanditsapplicationforpredictingselfinteractingproteinswithmultigrainedcascadeforestmodel