Cargando…

Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks

BACKGROUND: The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, labori...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Yang, Li, Zhanchao, Zhang, Yanfei, Ma, Yingjun, Huang, Qixing, Chen, Xingyu, Dai, Zong, Zou, Xiaoyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042949/
https://www.ncbi.nlm.nih.gov/pubmed/33845759
http://dx.doi.org/10.1186/s12859-021-04111-w
_version_ 1783678221958512640
author Wang, Yang
Li, Zhanchao
Zhang, Yanfei
Ma, Yingjun
Huang, Qixing
Chen, Xingyu
Dai, Zong
Zou, Xiaoyong
author_facet Wang, Yang
Li, Zhanchao
Zhang, Yanfei
Ma, Yingjun
Huang, Qixing
Chen, Xingyu
Dai, Zong
Zou, Xiaoyong
author_sort Wang, Yang
collection PubMed
description BACKGROUND: The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. RESULTS: We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. CONCLUSION: The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04111-w.
format Online
Article
Text
id pubmed-8042949
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-80429492021-04-14 Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks Wang, Yang Li, Zhanchao Zhang, Yanfei Ma, Yingjun Huang, Qixing Chen, Xingyu Dai, Zong Zou, Xiaoyong BMC Bioinformatics Research Article BACKGROUND: The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. RESULTS: We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. CONCLUSION: The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04111-w. BioMed Central 2021-04-12 /pmc/articles/PMC8042949/ /pubmed/33845759 http://dx.doi.org/10.1186/s12859-021-04111-w Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Wang, Yang
Li, Zhanchao
Zhang, Yanfei
Ma, Yingjun
Huang, Qixing
Chen, Xingyu
Dai, Zong
Zou, Xiaoyong
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_full Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_fullStr Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_full_unstemmed Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_short Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
title_sort performance improvement for a 2d convolutional neural network by using ssc encoding on protein–protein interaction tasks
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042949/
https://www.ncbi.nlm.nih.gov/pubmed/33845759
http://dx.doi.org/10.1186/s12859-021-04111-w
work_keys_str_mv AT wangyang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT lizhanchao performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT zhangyanfei performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT mayingjun performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT huangqixing performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT chenxingyu performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT daizong performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks
AT zouxiaoyong performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks