Cargando…

Kernel-based machine learning protocol for predicting DNA-binding proteins

DNA-binding proteins (DNA-BPs) play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Attempts have been made to identify DNA-BPs based on their sequence and structural information with moderate accuracy. Here we develop a machine...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bhardwaj, Nitin, Langlois, Robert E., Zhao, Guijun, Lu, Hui
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2005
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1283538/ https://www.ncbi.nlm.nih.gov/pubmed/16284202 http://dx.doi.org/10.1093/nar/gki949

_version_	1782126153847275520
author	Bhardwaj, Nitin Langlois, Robert E. Zhao, Guijun Lu, Hui
author_facet	Bhardwaj, Nitin Langlois, Robert E. Zhao, Guijun Lu, Hui
author_sort	Bhardwaj, Nitin
collection	PubMed
description	DNA-binding proteins (DNA-BPs) play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Attempts have been made to identify DNA-BPs based on their sequence and structural information with moderate accuracy. Here we develop a machine learning protocol for the prediction of DNA-BPs where the classifier is Support Vector Machines (SVMs). Information used for classification is derived from characteristics that include surface and overall composition, overall charge and positive potential patches on the protein surface. In total 121 DNA-BPs and 238 non-binding proteins are used to build and evaluate the protocol. In self-consistency, accuracy value of 100% has been achieved. For cross-validation (CV) optimization over entire dataset, we report an accuracy of 90%. Using leave 1-pair holdout evaluation, the accuracy of 86.3% has been achieved. When we restrict the dataset to less than 20% sequence identity amongst the proteins, the holdout accuracy is achieved at 85.8%. Furthermore, seven DNA-BPs with unbounded structures are all correctly predicted. The current performances are better than results published previously. The higher accuracy value achieved here originates from two factors: the ability of the SVM to handle features that demonstrate a wide range of discriminatory power and, a different definition of the positive patch. Since our protocol does not lean on sequence or structural homology, it can be used to identify or predict proteins with DNA-binding function(s) regardless of their homology to the known ones.
format	Text
id	pubmed-1283538
institution	National Center for Biotechnology Information
language	English
publishDate	2005
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-12835382005-11-16 Kernel-based machine learning protocol for predicting DNA-binding proteins Bhardwaj, Nitin Langlois, Robert E. Zhao, Guijun Lu, Hui Nucleic Acids Res Article DNA-binding proteins (DNA-BPs) play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Attempts have been made to identify DNA-BPs based on their sequence and structural information with moderate accuracy. Here we develop a machine learning protocol for the prediction of DNA-BPs where the classifier is Support Vector Machines (SVMs). Information used for classification is derived from characteristics that include surface and overall composition, overall charge and positive potential patches on the protein surface. In total 121 DNA-BPs and 238 non-binding proteins are used to build and evaluate the protocol. In self-consistency, accuracy value of 100% has been achieved. For cross-validation (CV) optimization over entire dataset, we report an accuracy of 90%. Using leave 1-pair holdout evaluation, the accuracy of 86.3% has been achieved. When we restrict the dataset to less than 20% sequence identity amongst the proteins, the holdout accuracy is achieved at 85.8%. Furthermore, seven DNA-BPs with unbounded structures are all correctly predicted. The current performances are better than results published previously. The higher accuracy value achieved here originates from two factors: the ability of the SVM to handle features that demonstrate a wide range of discriminatory power and, a different definition of the positive patch. Since our protocol does not lean on sequence or structural homology, it can be used to identify or predict proteins with DNA-binding function(s) regardless of their homology to the known ones. Oxford University Press 2005 2005-11-10 /pmc/articles/PMC1283538/ /pubmed/16284202 http://dx.doi.org/10.1093/nar/gki949 Text en © The Author 2005. Published by Oxford University Press. All rights reserved
spellingShingle	Article Bhardwaj, Nitin Langlois, Robert E. Zhao, Guijun Lu, Hui Kernel-based machine learning protocol for predicting DNA-binding proteins
title	Kernel-based machine learning protocol for predicting DNA-binding proteins
title_full	Kernel-based machine learning protocol for predicting DNA-binding proteins
title_fullStr	Kernel-based machine learning protocol for predicting DNA-binding proteins
title_full_unstemmed	Kernel-based machine learning protocol for predicting DNA-binding proteins
title_short	Kernel-based machine learning protocol for predicting DNA-binding proteins
title_sort	kernel-based machine learning protocol for predicting dna-binding proteins
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1283538/ https://www.ncbi.nlm.nih.gov/pubmed/16284202 http://dx.doi.org/10.1093/nar/gki949
work_keys_str_mv	AT bhardwajnitin kernelbasedmachinelearningprotocolforpredictingdnabindingproteins AT langloisroberte kernelbasedmachinelearningprotocolforpredictingdnabindingproteins AT zhaoguijun kernelbasedmachinelearningprotocolforpredictingdnabindingproteins AT luhui kernelbasedmachinelearningprotocolforpredictingdnabindingproteins

Kernel-based machine learning protocol for predicting DNA-binding proteins

Ejemplares similares