Cargando…

Prediction of DNA-binding residues from protein sequence information using random forests

BACKGROUND: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for o...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Liangjiang, Yang, Mary Qu, Yang, Jack Y
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709252/ https://www.ncbi.nlm.nih.gov/pubmed/19594868 http://dx.doi.org/10.1186/1471-2164-10-S1-S1

_version_	1782169283603726336
author	Wang, Liangjiang Yang, Mary Qu Yang, Jack Y
author_facet	Wang, Liangjiang Yang, Mary Qu Yang, Jack Y
author_sort	Wang, Liangjiang
collection	PubMed
description	BACKGROUND: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. RESULTS: A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures. CONCLUSION: The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF has thus been developed to make the RF classifier accessible to the biological research community.
format	Text
id	pubmed-2709252
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-27092522009-07-14 Prediction of DNA-binding residues from protein sequence information using random forests Wang, Liangjiang Yang, Mary Qu Yang, Jack Y BMC Genomics Research BACKGROUND: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. RESULTS: A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures. CONCLUSION: The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF has thus been developed to make the RF classifier accessible to the biological research community. BioMed Central 2009-07-07 /pmc/articles/PMC2709252/ /pubmed/19594868 http://dx.doi.org/10.1186/1471-2164-10-S1-S1 Text en Copyright © 2009 Wang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Wang, Liangjiang Yang, Mary Qu Yang, Jack Y Prediction of DNA-binding residues from protein sequence information using random forests
title	Prediction of DNA-binding residues from protein sequence information using random forests
title_full	Prediction of DNA-binding residues from protein sequence information using random forests
title_fullStr	Prediction of DNA-binding residues from protein sequence information using random forests
title_full_unstemmed	Prediction of DNA-binding residues from protein sequence information using random forests
title_short	Prediction of DNA-binding residues from protein sequence information using random forests
title_sort	prediction of dna-binding residues from protein sequence information using random forests
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709252/ https://www.ncbi.nlm.nih.gov/pubmed/19594868 http://dx.doi.org/10.1186/1471-2164-10-S1-S1
work_keys_str_mv	AT wangliangjiang predictionofdnabindingresiduesfromproteinsequenceinformationusingrandomforests AT yangmaryqu predictionofdnabindingresiduesfromproteinsequenceinformationusingrandomforests AT yangjacky predictionofdnabindingresiduesfromproteinsequenceinformationusingrandomforests

Prediction of DNA-binding residues from protein sequence information using random forests

Ejemplares similares