Cargando…

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation

AIM AND OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Chun, Zhao, Jialing, Wang, Changzhong, Yao, Yuhua
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Bentham Science Publishers 2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5930480/ https://www.ncbi.nlm.nih.gov/pubmed/29380690 http://dx.doi.org/10.2174/1386207321666180130100838

_version_	1783319503234400256
author	Li, Chun Zhao, Jialing Wang, Changzhong Yao, Yuhua
author_facet	Li, Chun Zhao, Jialing Wang, Changzhong Yao, Yuhua
author_sort	Li, Chun
collection	PubMed
description	AIM AND OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. CONCLUSION: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.
format	Online Article Text
id	pubmed-5930480
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Bentham Science Publishers
record_format	MEDLINE/PubMed
spelling	pubmed-59304802018-05-11 Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation Li, Chun Zhao, Jialing Wang, Changzhong Yao, Yuhua Comb Chem High Throughput Screen Article AIM AND OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. RESULTS: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. CONCLUSION: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins. Bentham Science Publishers 2018-02 2018-02 /pmc/articles/PMC5930480/ /pubmed/29380690 http://dx.doi.org/10.2174/1386207321666180130100838 Text en © 2018 Bentham Science Publishers https://creativecommons.org/licenses/by-nc/4.0/legalcode This is an open access article licensed under the terms of the Creative Commons Attribution-Non-Commercial 4.0 International Public License (CC BY-NC 4.0) (https://creativecommons.org/licenses/by-nc/4.0/legalcode), which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.
spellingShingle	Article Li, Chun Zhao, Jialing Wang, Changzhong Yao, Yuhua Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
title	Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
title_full	Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
title_fullStr	Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
title_full_unstemmed	Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
title_short	Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation
title_sort	protein sequence comparison and dna-binding protein identification with generalized pseaac and graphical representation
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5930480/ https://www.ncbi.nlm.nih.gov/pubmed/29380690 http://dx.doi.org/10.2174/1386207321666180130100838
work_keys_str_mv	AT lichun proteinsequencecomparisonanddnabindingproteinidentificationwithgeneralizedpseaacandgraphicalrepresentation AT zhaojialing proteinsequencecomparisonanddnabindingproteinidentificationwithgeneralizedpseaacandgraphicalrepresentation AT wangchangzhong proteinsequencecomparisonanddnabindingproteinidentificationwithgeneralizedpseaacandgraphicalrepresentation AT yaoyuhua proteinsequencecomparisonanddnabindingproteinidentificationwithgeneralizedpseaacandgraphicalrepresentation

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation

Ejemplares similares