Cargando…

Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wu, Jiansheng, Liu, Hongde, Duan, Xueye, Ding, Yan, Wu, Hongtao, Bai, Yunfei, Sun, Xiao
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2009
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638931/ https://www.ncbi.nlm.nih.gov/pubmed/19008251 http://dx.doi.org/10.1093/bioinformatics/btn583

_version_	1782164430186872832
author	Wu, Jiansheng Liu, Hongde Duan, Xueye Ding, Yan Wu, Hongtao Bai, Yunfei Sun, Xiao
author_facet	Wu, Jiansheng Liu, Hongde Duan, Xueye Ding, Yan Wu, Hongtao Bai, Yunfei Sun, Xiao
author_sort	Wu, Jiansheng
collection	PubMed
description	Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions. Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm. Contact: xsun@seu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.
format	Text
id	pubmed-2638931
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-26389312009-02-25 Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature Wu, Jiansheng Liu, Hongde Duan, Xueye Ding, Yan Wu, Hongtao Bai, Yunfei Sun, Xiao Bioinformatics Original Papers Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical–chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class. Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein–DNA interactions. Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm. Contact: xsun@seu.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2009-01-01 2008-11-12 /pmc/articles/PMC2638931/ /pubmed/19008251 http://dx.doi.org/10.1093/bioinformatics/btn583 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Wu, Jiansheng Liu, Hongde Duan, Xueye Ding, Yan Wu, Hongtao Bai, Yunfei Sun, Xiao Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
title	Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
title_full	Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
title_fullStr	Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
title_full_unstemmed	Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
title_short	Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
title_sort	prediction of dna-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2638931/ https://www.ncbi.nlm.nih.gov/pubmed/19008251 http://dx.doi.org/10.1093/bioinformatics/btn583
work_keys_str_mv	AT wujiansheng predictionofdnabindingresiduesinproteinsfromaminoacidsequencesusingarandomforestmodelwithahybridfeature AT liuhongde predictionofdnabindingresiduesinproteinsfromaminoacidsequencesusingarandomforestmodelwithahybridfeature AT duanxueye predictionofdnabindingresiduesinproteinsfromaminoacidsequencesusingarandomforestmodelwithahybridfeature AT dingyan predictionofdnabindingresiduesinproteinsfromaminoacidsequencesusingarandomforestmodelwithahybridfeature AT wuhongtao predictionofdnabindingresiduesinproteinsfromaminoacidsequencesusingarandomforestmodelwithahybridfeature AT baiyunfei predictionofdnabindingresiduesinproteinsfromaminoacidsequencesusingarandomforestmodelwithahybridfeature AT sunxiao predictionofdnabindingresiduesinproteinsfromaminoacidsequencesusingarandomforestmodelwithahybridfeature

Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

Ejemplares similares