Cargando…

Prediction of DNA-binding residues from protein sequence information using random forests

BACKGROUND: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for o...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Liangjiang, Yang, Mary Qu, Yang, Jack Y
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709252/
https://www.ncbi.nlm.nih.gov/pubmed/19594868
http://dx.doi.org/10.1186/1471-2164-10-S1-S1
_version_ 1782169283603726336
author Wang, Liangjiang
Yang, Mary Qu
Yang, Jack Y
author_facet Wang, Liangjiang
Yang, Mary Qu
Yang, Jack Y
author_sort Wang, Liangjiang
collection PubMed
description BACKGROUND: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. RESULTS: A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures. CONCLUSION: The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF has thus been developed to make the RF classifier accessible to the biological research community.
format Text
id pubmed-2709252
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27092522009-07-14 Prediction of DNA-binding residues from protein sequence information using random forests Wang, Liangjiang Yang, Mary Qu Yang, Jack Y BMC Genomics Research BACKGROUND: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. RESULTS: A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures. CONCLUSION: The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF has thus been developed to make the RF classifier accessible to the biological research community. BioMed Central 2009-07-07 /pmc/articles/PMC2709252/ /pubmed/19594868 http://dx.doi.org/10.1186/1471-2164-10-S1-S1 Text en Copyright © 2009 Wang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Wang, Liangjiang
Yang, Mary Qu
Yang, Jack Y
Prediction of DNA-binding residues from protein sequence information using random forests
title Prediction of DNA-binding residues from protein sequence information using random forests
title_full Prediction of DNA-binding residues from protein sequence information using random forests
title_fullStr Prediction of DNA-binding residues from protein sequence information using random forests
title_full_unstemmed Prediction of DNA-binding residues from protein sequence information using random forests
title_short Prediction of DNA-binding residues from protein sequence information using random forests
title_sort prediction of dna-binding residues from protein sequence information using random forests
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709252/
https://www.ncbi.nlm.nih.gov/pubmed/19594868
http://dx.doi.org/10.1186/1471-2164-10-S1-S1
work_keys_str_mv AT wangliangjiang predictionofdnabindingresiduesfromproteinsequenceinformationusingrandomforests
AT yangmaryqu predictionofdnabindingresiduesfromproteinsequenceinformationusingrandomforests
AT yangjacky predictionofdnabindingresiduesfromproteinsequenceinformationusingrandomforests