Cargando…
Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins
Protein–DNA complexes play vital roles in many cellular processes by the interactions of amino acids with DNA. Several computational methods have been developed for predicting the interacting residues in DNA-binding proteins using sequence and/or structural information. These methods showed differen...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3763535/ https://www.ncbi.nlm.nih.gov/pubmed/23788679 http://dx.doi.org/10.1093/nar/gkt544 |
_version_ | 1782283029784297472 |
---|---|
author | Nagarajan, R. Ahmad, Shandar Michael Gromiha, M. |
author_facet | Nagarajan, R. Ahmad, Shandar Michael Gromiha, M. |
author_sort | Nagarajan, R. |
collection | PubMed |
description | Protein–DNA complexes play vital roles in many cellular processes by the interactions of amino acids with DNA. Several computational methods have been developed for predicting the interacting residues in DNA-binding proteins using sequence and/or structural information. These methods showed different levels of accuracies, which may depend on the choice of data sets used in training, the feature sets selected for developing a predictive model, the ability of the models to capture information useful for prediction or a combination of these factors. In many cases, different methods are likely to produce similar results, whereas in others, the predictors may return contradictory predictions. In this situation, a priori estimates of prediction performance applicable to the system being investigated would be helpful for biologists to choose the best method for designing their experiments. In this work, we have constructed unbiased, stringent and diverse data sets for DNA-binding proteins based on various biologically relevant considerations: (i) seven structural classes, (ii) 86 folds, (iii) 106 superfamilies, (iv) 194 families, (v) 15 binding motifs, (vi) single/double-stranded DNA, (vii) DNA conformation (A, B, Z, etc.), (viii) three functions and (ix) disordered regions. These data sets were culled as non-redundant with sequence identities of 25 and 40% and used to evaluate the performance of 11 different methods in which online services or standalone programs are available. We observed that the best performing methods for each of the data sets showed significant biases toward the data sets selected for their benchmark. Our analysis revealed important data set features, which could be used to estimate these context-specific biases and hence suggest the best method to be used for a given problem. We have developed a web server, which considers these features on demand and displays the best method that the investigator should use. The web server is freely available at http://www.biotech.iitm.ac.in/DNA-protein/. Further, we have grouped the methods based on their complexity and analyzed the performance. The information gained in this work could be effectively used to select the best method for designing experiments. |
format | Online Article Text |
id | pubmed-3763535 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-37635352013-09-10 Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins Nagarajan, R. Ahmad, Shandar Michael Gromiha, M. Nucleic Acids Res Computational Biology Protein–DNA complexes play vital roles in many cellular processes by the interactions of amino acids with DNA. Several computational methods have been developed for predicting the interacting residues in DNA-binding proteins using sequence and/or structural information. These methods showed different levels of accuracies, which may depend on the choice of data sets used in training, the feature sets selected for developing a predictive model, the ability of the models to capture information useful for prediction or a combination of these factors. In many cases, different methods are likely to produce similar results, whereas in others, the predictors may return contradictory predictions. In this situation, a priori estimates of prediction performance applicable to the system being investigated would be helpful for biologists to choose the best method for designing their experiments. In this work, we have constructed unbiased, stringent and diverse data sets for DNA-binding proteins based on various biologically relevant considerations: (i) seven structural classes, (ii) 86 folds, (iii) 106 superfamilies, (iv) 194 families, (v) 15 binding motifs, (vi) single/double-stranded DNA, (vii) DNA conformation (A, B, Z, etc.), (viii) three functions and (ix) disordered regions. These data sets were culled as non-redundant with sequence identities of 25 and 40% and used to evaluate the performance of 11 different methods in which online services or standalone programs are available. We observed that the best performing methods for each of the data sets showed significant biases toward the data sets selected for their benchmark. Our analysis revealed important data set features, which could be used to estimate these context-specific biases and hence suggest the best method to be used for a given problem. We have developed a web server, which considers these features on demand and displays the best method that the investigator should use. The web server is freely available at http://www.biotech.iitm.ac.in/DNA-protein/. Further, we have grouped the methods based on their complexity and analyzed the performance. The information gained in this work could be effectively used to select the best method for designing experiments. Oxford University Press 2013-09 2013-06-20 /pmc/articles/PMC3763535/ /pubmed/23788679 http://dx.doi.org/10.1093/nar/gkt544 Text en © The Author(s) 2013. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Computational Biology Nagarajan, R. Ahmad, Shandar Michael Gromiha, M. Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins |
title | Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins |
title_full | Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins |
title_fullStr | Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins |
title_full_unstemmed | Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins |
title_short | Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins |
title_sort | novel approach for selecting the best predictor for identifying the binding sites in dna binding proteins |
topic | Computational Biology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3763535/ https://www.ncbi.nlm.nih.gov/pubmed/23788679 http://dx.doi.org/10.1093/nar/gkt544 |
work_keys_str_mv | AT nagarajanr novelapproachforselectingthebestpredictorforidentifyingthebindingsitesindnabindingproteins AT ahmadshandar novelapproachforselectingthebestpredictorforidentifyingthebindingsitesindnabindingproteins AT michaelgromiham novelapproachforselectingthebestpredictorforidentifyingthebindingsitesindnabindingproteins |