Cargando…
Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks
BACKGROUND: The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, labori...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042949/ https://www.ncbi.nlm.nih.gov/pubmed/33845759 http://dx.doi.org/10.1186/s12859-021-04111-w |
_version_ | 1783678221958512640 |
---|---|
author | Wang, Yang Li, Zhanchao Zhang, Yanfei Ma, Yingjun Huang, Qixing Chen, Xingyu Dai, Zong Zou, Xiaoyong |
author_facet | Wang, Yang Li, Zhanchao Zhang, Yanfei Ma, Yingjun Huang, Qixing Chen, Xingyu Dai, Zong Zou, Xiaoyong |
author_sort | Wang, Yang |
collection | PubMed |
description | BACKGROUND: The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. RESULTS: We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. CONCLUSION: The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04111-w. |
format | Online Article Text |
id | pubmed-8042949 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-80429492021-04-14 Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks Wang, Yang Li, Zhanchao Zhang, Yanfei Ma, Yingjun Huang, Qixing Chen, Xingyu Dai, Zong Zou, Xiaoyong BMC Bioinformatics Research Article BACKGROUND: The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. RESULTS: We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. CONCLUSION: The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04111-w. BioMed Central 2021-04-12 /pmc/articles/PMC8042949/ /pubmed/33845759 http://dx.doi.org/10.1186/s12859-021-04111-w Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Wang, Yang Li, Zhanchao Zhang, Yanfei Ma, Yingjun Huang, Qixing Chen, Xingyu Dai, Zong Zou, Xiaoyong Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title | Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_full | Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_fullStr | Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_full_unstemmed | Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_short | Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks |
title_sort | performance improvement for a 2d convolutional neural network by using ssc encoding on protein–protein interaction tasks |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8042949/ https://www.ncbi.nlm.nih.gov/pubmed/33845759 http://dx.doi.org/10.1186/s12859-021-04111-w |
work_keys_str_mv | AT wangyang performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT lizhanchao performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT zhangyanfei performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT mayingjun performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT huangqixing performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT chenxingyu performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT daizong performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks AT zouxiaoyong performanceimprovementfora2dconvolutionalneuralnetworkbyusingsscencodingonproteinproteininteractiontasks |