Cargando…

Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree

Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods informa...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhou, Chang, Yu, Hua, Ding, Yijie, Guo, Fei, Gong, Xiu-Jun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5549711/
https://www.ncbi.nlm.nih.gov/pubmed/28792503
http://dx.doi.org/10.1371/journal.pone.0181426
_version_ 1783256016546168832
author Zhou, Chang
Yu, Hua
Ding, Yijie
Guo, Fei
Gong, Xiu-Jun
author_facet Zhou, Chang
Yu, Hua
Ding, Yijie
Guo, Fei
Gong, Xiu-Jun
author_sort Zhou, Chang
collection PubMed
description Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew’s correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies.
format Online
Article
Text
id pubmed-5549711
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-55497112017-08-12 Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree Zhou, Chang Yu, Hua Ding, Yijie Guo, Fei Gong, Xiu-Jun PLoS One Research Article Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew’s correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies. Public Library of Science 2017-08-08 /pmc/articles/PMC5549711/ /pubmed/28792503 http://dx.doi.org/10.1371/journal.pone.0181426 Text en © 2017 Zhou et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Zhou, Chang
Yu, Hua
Ding, Yijie
Guo, Fei
Gong, Xiu-Jun
Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
title Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
title_full Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
title_fullStr Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
title_full_unstemmed Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
title_short Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
title_sort multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5549711/
https://www.ncbi.nlm.nih.gov/pubmed/28792503
http://dx.doi.org/10.1371/journal.pone.0181426
work_keys_str_mv AT zhouchang multiscaleencodingofaminoacidsequencesforpredictingproteininteractionsusinggradientboostingdecisiontree
AT yuhua multiscaleencodingofaminoacidsequencesforpredictingproteininteractionsusinggradientboostingdecisiontree
AT dingyijie multiscaleencodingofaminoacidsequencesforpredictingproteininteractionsusinggradientboostingdecisiontree
AT guofei multiscaleencodingofaminoacidsequencesforpredictingproteininteractionsusinggradientboostingdecisiontree
AT gongxiujun multiscaleencodingofaminoacidsequencesforpredictingproteininteractionsusinggradientboostingdecisiontree