Cargando…

Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties

BACKGROUND: Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on kn...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Hui-Lin, Lin, I-Che, Liou, Yi-Fan, Tsai, Chia-Ta, Hsu, Kai-Ti, Huang, Wen-Lin, Ho, Shinn-Jang, Ho, Shinn-Ying
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044304/
https://www.ncbi.nlm.nih.gov/pubmed/21342579
http://dx.doi.org/10.1186/1471-2105-12-S1-S47
_version_ 1782198716033138688
author Huang, Hui-Lin
Lin, I-Che
Liou, Yi-Fan
Tsai, Chia-Ta
Hsu, Kai-Ti
Huang, Wen-Lin
Ho, Shinn-Jang
Ho, Shinn-Ying
author_facet Huang, Hui-Lin
Lin, I-Che
Liou, Yi-Fan
Tsai, Chia-Ta
Hsu, Kai-Ti
Huang, Wen-Lin
Ho, Shinn-Jang
Ho, Shinn-Ying
author_sort Huang, Hui-Lin
collection PubMed
description BACKGROUND: Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors. RESULTS: This study proposes a systematic approach (named Auto-IDPCPs) to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 amino acid indices in AAindex into 20 clusters using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected features to identify related physicochemical properties which may affect the binding mechanism of DNA-binding domains/proteins. The proposed Auto-IDPCPs identified m=22 features of properties belonging to five clusters for predicting DNA-binding domains with a five-fold cross-validation accuracy of 87.12%, which is promising compared with the accuracy of 86.62% of the existing method PSSM-400. For predicting DNA-binding sequences, the accuracy of 75.50% was obtained using m=28 features, where PSSM-400 has an accuracy of 74.22%. Auto-IDPCPs and PSSM-400 have accuracies of 80.73% and 82.81%, respectively, applied to an independent test data set of DNA-binding domains. Some typical physicochemical properties discovered are hydrophobicity, secondary structure, charge, solvent accessibility, polarity, flexibility, normalized Van Der Waals volume, pK (pK-C, pK-N, pK-COOH and pK-a(RCOOH)), etc. CONCLUSIONS: The proposed approach Auto-IDPCPs would help designers to investigate informative physicochemical and biochemical properties by considering both prediction accuracy and analysis of binding mechanism simultaneously. The approach Auto-IDPCPs can be also applicable to predict and analyze other protein functions from sequences.
format Text
id pubmed-3044304
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30443042011-02-25 Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties Huang, Hui-Lin Lin, I-Che Liou, Yi-Fan Tsai, Chia-Ta Hsu, Kai-Ti Huang, Wen-Lin Ho, Shinn-Jang Ho, Shinn-Ying BMC Bioinformatics Research BACKGROUND: Existing methods of predicting DNA-binding proteins used valuable features of physicochemical properties to design support vector machine (SVM) based classifiers. Generally, selection of physicochemical properties and determination of their corresponding feature vectors rely mainly on known properties of binding mechanism and experience of designers. However, there exists a troublesome problem for designers that some different physicochemical properties have similar vectors of representing 20 amino acids and some closely related physicochemical properties have dissimilar vectors. RESULTS: This study proposes a systematic approach (named Auto-IDPCPs) to automatically identify a set of physicochemical and biochemical properties in the AAindex database to design SVM-based classifiers for predicting and analyzing DNA-binding domains/proteins. Auto-IDPCPs consists of 1) clustering 531 amino acid indices in AAindex into 20 clusters using a fuzzy c-means algorithm, 2) utilizing an efficient genetic algorithm based optimization method IBCGA to select an informative feature set of size m to represent sequences, and 3) analyzing the selected features to identify related physicochemical properties which may affect the binding mechanism of DNA-binding domains/proteins. The proposed Auto-IDPCPs identified m=22 features of properties belonging to five clusters for predicting DNA-binding domains with a five-fold cross-validation accuracy of 87.12%, which is promising compared with the accuracy of 86.62% of the existing method PSSM-400. For predicting DNA-binding sequences, the accuracy of 75.50% was obtained using m=28 features, where PSSM-400 has an accuracy of 74.22%. Auto-IDPCPs and PSSM-400 have accuracies of 80.73% and 82.81%, respectively, applied to an independent test data set of DNA-binding domains. Some typical physicochemical properties discovered are hydrophobicity, secondary structure, charge, solvent accessibility, polarity, flexibility, normalized Van Der Waals volume, pK (pK-C, pK-N, pK-COOH and pK-a(RCOOH)), etc. CONCLUSIONS: The proposed approach Auto-IDPCPs would help designers to investigate informative physicochemical and biochemical properties by considering both prediction accuracy and analysis of binding mechanism simultaneously. The approach Auto-IDPCPs can be also applicable to predict and analyze other protein functions from sequences. BioMed Central 2011-02-15 /pmc/articles/PMC3044304/ /pubmed/21342579 http://dx.doi.org/10.1186/1471-2105-12-S1-S47 Text en Copyright ©2011 Huang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Huang, Hui-Lin
Lin, I-Che
Liou, Yi-Fan
Tsai, Chia-Ta
Hsu, Kai-Ti
Huang, Wen-Lin
Ho, Shinn-Jang
Ho, Shinn-Ying
Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
title Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
title_full Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
title_fullStr Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
title_full_unstemmed Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
title_short Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
title_sort predicting and analyzing dna-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3044304/
https://www.ncbi.nlm.nih.gov/pubmed/21342579
http://dx.doi.org/10.1186/1471-2105-12-S1-S47
work_keys_str_mv AT huanghuilin predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties
AT liniche predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties
AT liouyifan predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties
AT tsaichiata predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties
AT hsukaiti predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties
AT huangwenlin predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties
AT hoshinnjang predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties
AT hoshinnying predictingandanalyzingdnabindingdomainsusingasystematicapproachtoidentifyingasetofinformativephysicochemicalandbiochemicalproperties