Cargando…

Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lou, Wangchao, Wang, Xiaoqing, Chen, Fan, Chen, Yixiao, Jiang, Bo, Zhang, Hua
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901691/ https://www.ncbi.nlm.nih.gov/pubmed/24475169 http://dx.doi.org/10.1371/journal.pone.0086703

_version_	1782300886907748352
author	Lou, Wangchao Wang, Xiaoqing Chen, Fan Chen, Yixiao Jiang, Bo Zhang, Hua
author_facet	Lou, Wangchao Wang, Xiaoqing Chen, Fan Chen, Yixiao Jiang, Bo Zhang, Hua
author_sort	Lou, Wangchao
collection	PubMed
description	Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
format	Online Article Text
id	pubmed-3901691
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-39016912014-01-28 Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes Lou, Wangchao Wang, Xiaoqing Chen, Fan Chen, Yixiao Jiang, Bo Zhang, Hua PLoS One Research Article Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins. Public Library of Science 2014-01-24 /pmc/articles/PMC3901691/ /pubmed/24475169 http://dx.doi.org/10.1371/journal.pone.0086703 Text en © 2014 Lou et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Lou, Wangchao Wang, Xiaoqing Chen, Fan Chen, Yixiao Jiang, Bo Zhang, Hua Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes
title	Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes
title_full	Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes
title_fullStr	Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes
title_full_unstemmed	Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes
title_short	Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes
title_sort	sequence based prediction of dna-binding proteins based on hybrid feature selection using random forest and gaussian naïve bayes
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3901691/ https://www.ncbi.nlm.nih.gov/pubmed/24475169 http://dx.doi.org/10.1371/journal.pone.0086703
work_keys_str_mv	AT louwangchao sequencebasedpredictionofdnabindingproteinsbasedonhybridfeatureselectionusingrandomforestandgaussiannaivebayes AT wangxiaoqing sequencebasedpredictionofdnabindingproteinsbasedonhybridfeatureselectionusingrandomforestandgaussiannaivebayes AT chenfan sequencebasedpredictionofdnabindingproteinsbasedonhybridfeatureselectionusingrandomforestandgaussiannaivebayes AT chenyixiao sequencebasedpredictionofdnabindingproteinsbasedonhybridfeatureselectionusingrandomforestandgaussiannaivebayes AT jiangbo sequencebasedpredictionofdnabindingproteinsbasedonhybridfeatureselectionusingrandomforestandgaussiannaivebayes AT zhanghua sequencebasedpredictionofdnabindingproteinsbasedonhybridfeatureselectionusingrandomforestandgaussiannaivebayes

Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

Ejemplares similares