Cargando…

Predicting DNA-binding sites of proteins from amino acid sequence

BACKGROUND: Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. We present a machine learning approach for the identification of amino acid residues involved in protein-DNA interactions. RESULTS: We start with a Naïve Bayes c...

Descripción completa

Detalles Bibliográficos
Autores principales: Yan, Changhui, Terribilini, Michael, Wu, Feihong, Jernigan, Robert L, Dobbs, Drena, Honavar, Vasant
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1534068/
https://www.ncbi.nlm.nih.gov/pubmed/16712732
http://dx.doi.org/10.1186/1471-2105-7-262
_version_ 1782129099596103680
author Yan, Changhui
Terribilini, Michael
Wu, Feihong
Jernigan, Robert L
Dobbs, Drena
Honavar, Vasant
author_facet Yan, Changhui
Terribilini, Michael
Wu, Feihong
Jernigan, Robert L
Dobbs, Drena
Honavar, Vasant
author_sort Yan, Changhui
collection PubMed
description BACKGROUND: Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. We present a machine learning approach for the identification of amino acid residues involved in protein-DNA interactions. RESULTS: We start with a Naïve Bayes classifier trained to predict whether a given amino acid residue is a DNA-binding residue based on its identity and the identities of its sequence neighbors. The input to the classifier consists of the identities of the target residue and 4 sequence neighbors on each side of the target residue. The classifier is trained and evaluated (using leave-one-out cross-validation) on a non-redundant set of 171 proteins. Our results indicate the feasibility of identifying interface residues based on local sequence information. The classifier achieves 71% overall accuracy with a correlation coefficient of 0.24, 35% specificity and 53% sensitivity in identifying interface residues as evaluated by leave-one-out cross-validation. We show that the performance of the classifier is improved by using sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) as additional input. The classifier achieves 78% overall accuracy with a correlation coefficient of 0.28, 44% specificity and 41% sensitivity in identifying interface residues. Examination of the predictions in the context of 3-dimensional structures of proteins demonstrates the effectiveness of this method in identifying DNA-binding sites from sequence information. In 33% (56 out of 171) of the proteins, the classifier identifies the interaction sites by correctly recognizing at least half of the interface residues. In 87% (149 out of 171) of the proteins, the classifier correctly identifies at least 20% of the interface residues. This suggests the possibility of using such classifiers to identify potential DNA-binding motifs and to gain potentially useful insights into sequence correlates of protein-DNA interactions. CONCLUSION: Naïve Bayes classifiers trained to identify DNA-binding residues using sequence information offer a computationally efficient approach to identifying putative DNA-binding sites in DNA-binding proteins and recognizing potential DNA-binding motifs.
format Text
id pubmed-1534068
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-15340682006-08-10 Predicting DNA-binding sites of proteins from amino acid sequence Yan, Changhui Terribilini, Michael Wu, Feihong Jernigan, Robert L Dobbs, Drena Honavar, Vasant BMC Bioinformatics Research Article BACKGROUND: Understanding the molecular details of protein-DNA interactions is critical for deciphering the mechanisms of gene regulation. We present a machine learning approach for the identification of amino acid residues involved in protein-DNA interactions. RESULTS: We start with a Naïve Bayes classifier trained to predict whether a given amino acid residue is a DNA-binding residue based on its identity and the identities of its sequence neighbors. The input to the classifier consists of the identities of the target residue and 4 sequence neighbors on each side of the target residue. The classifier is trained and evaluated (using leave-one-out cross-validation) on a non-redundant set of 171 proteins. Our results indicate the feasibility of identifying interface residues based on local sequence information. The classifier achieves 71% overall accuracy with a correlation coefficient of 0.24, 35% specificity and 53% sensitivity in identifying interface residues as evaluated by leave-one-out cross-validation. We show that the performance of the classifier is improved by using sequence entropy of the target residue (the entropy of the corresponding column in multiple alignment obtained by aligning the target sequence with its sequence homologs) as additional input. The classifier achieves 78% overall accuracy with a correlation coefficient of 0.28, 44% specificity and 41% sensitivity in identifying interface residues. Examination of the predictions in the context of 3-dimensional structures of proteins demonstrates the effectiveness of this method in identifying DNA-binding sites from sequence information. In 33% (56 out of 171) of the proteins, the classifier identifies the interaction sites by correctly recognizing at least half of the interface residues. In 87% (149 out of 171) of the proteins, the classifier correctly identifies at least 20% of the interface residues. This suggests the possibility of using such classifiers to identify potential DNA-binding motifs and to gain potentially useful insights into sequence correlates of protein-DNA interactions. CONCLUSION: Naïve Bayes classifiers trained to identify DNA-binding residues using sequence information offer a computationally efficient approach to identifying putative DNA-binding sites in DNA-binding proteins and recognizing potential DNA-binding motifs. BioMed Central 2006-05-19 /pmc/articles/PMC1534068/ /pubmed/16712732 http://dx.doi.org/10.1186/1471-2105-7-262 Text en Copyright © 2006 Yan et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Yan, Changhui
Terribilini, Michael
Wu, Feihong
Jernigan, Robert L
Dobbs, Drena
Honavar, Vasant
Predicting DNA-binding sites of proteins from amino acid sequence
title Predicting DNA-binding sites of proteins from amino acid sequence
title_full Predicting DNA-binding sites of proteins from amino acid sequence
title_fullStr Predicting DNA-binding sites of proteins from amino acid sequence
title_full_unstemmed Predicting DNA-binding sites of proteins from amino acid sequence
title_short Predicting DNA-binding sites of proteins from amino acid sequence
title_sort predicting dna-binding sites of proteins from amino acid sequence
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1534068/
https://www.ncbi.nlm.nih.gov/pubmed/16712732
http://dx.doi.org/10.1186/1471-2105-7-262
work_keys_str_mv AT yanchanghui predictingdnabindingsitesofproteinsfromaminoacidsequence
AT terribilinimichael predictingdnabindingsitesofproteinsfromaminoacidsequence
AT wufeihong predictingdnabindingsitesofproteinsfromaminoacidsequence
AT jerniganrobertl predictingdnabindingsitesofproteinsfromaminoacidsequence
AT dobbsdrena predictingdnabindingsitesofproteinsfromaminoacidsequence
AT honavarvasant predictingdnabindingsitesofproteinsfromaminoacidsequence