Cargando…

Logic minimization and rule extraction for identification of functional sites in molecular sequences

BACKGROUND: Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been use...

Descripción completa

Detalles Bibliográficos
Autores principales: Cruz-Cano, Raul, Lee, Mei-Ling Ting, Leung, Ming-Ying
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3492099/
https://www.ncbi.nlm.nih.gov/pubmed/22897894
http://dx.doi.org/10.1186/1756-0381-5-10
_version_ 1782249056507002880
author Cruz-Cano, Raul
Lee, Mei-Ling Ting
Leung, Ming-Ying
author_facet Cruz-Cano, Raul
Lee, Mei-Ling Ting
Leung, Ming-Ying
author_sort Cruz-Cano, Raul
collection PubMed
description BACKGROUND: Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been used in classification and rule discovery problems. In this paper, we propose a method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins. TFBS are important in various developmental processes and glycosylation is a posttranslational modification critical to protein functions. METHODS: In the present study, we first transformed the original biological dataset into a suitable binary form. Logic minimization was then applied to generate sets of simple rules to describe the transformed dataset. These rules were used to predict TFBS and O-glycosylation sites. The TFBS dataset is obtained from the TRANSFAC database, while the glycosylation dataset was compiled using information from OGLYCBASE and the Swiss-Prot Database. We performed the same predictions using two standard classification techniques, Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and used their sensitivities and positive predictive values as benchmarks for the performance of our proposed algorithm. SVM were also used to reduce the number of variables included in the logic minimization approach. RESULTS: For both TFBS and O-glycosylation sites, the prediction performance of the proposed logic minimization method was generally comparable and, in some cases, superior to the standard ANN and SVM classification methods with the advantage of providing intelligible rules to describe the datasets. In TFBS prediction, logic minimization produced a very small set of simple rules. In glycosylation site prediction, the rules produced were also interpretable and the most popular rules generated appeared to correlate well with recently reported hydrophilic/hydrophobic enhancement values of amino acids around possible O-glycosylation sites. Experiments with Self-Organizing Neural Networks corroborate the practical worth of the logic minimization method for these case studies. CONCLUSIONS: The proposed logic minimization algorithm provides sets of rules that can be used to predict TFBS and O-glycosylation sites with sensitivity and positive predictive value comparable to those from ANN and SVM. Moreover, the logic minimization method has the additional capability of generating interpretable rules that allow biological scientists to correlate the predictions with other experimental results and to form new hypotheses for further investigation. Additional experiments with alternative rule-extraction techniques demonstrate that the logic minimization method is able to produce accurate rules from datasets with large numbers of variables and limited numbers of positive examples.
format Online
Article
Text
id pubmed-3492099
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-34920992012-11-09 Logic minimization and rule extraction for identification of functional sites in molecular sequences Cruz-Cano, Raul Lee, Mei-Ling Ting Leung, Ming-Ying BioData Min Research BACKGROUND: Logic minimization is the application of algebraic axioms to a binary dataset with the purpose of reducing the number of digital variables and/or rules needed to express it. Although logic minimization techniques have been applied to bioinformatics datasets before, they have not been used in classification and rule discovery problems. In this paper, we propose a method based on logic minimization to extract predictive rules for two bioinformatics problems involving the identification of functional sites in molecular sequences: transcription factor binding sites (TFBS) in DNA and O-glycosylation sites in proteins. TFBS are important in various developmental processes and glycosylation is a posttranslational modification critical to protein functions. METHODS: In the present study, we first transformed the original biological dataset into a suitable binary form. Logic minimization was then applied to generate sets of simple rules to describe the transformed dataset. These rules were used to predict TFBS and O-glycosylation sites. The TFBS dataset is obtained from the TRANSFAC database, while the glycosylation dataset was compiled using information from OGLYCBASE and the Swiss-Prot Database. We performed the same predictions using two standard classification techniques, Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and used their sensitivities and positive predictive values as benchmarks for the performance of our proposed algorithm. SVM were also used to reduce the number of variables included in the logic minimization approach. RESULTS: For both TFBS and O-glycosylation sites, the prediction performance of the proposed logic minimization method was generally comparable and, in some cases, superior to the standard ANN and SVM classification methods with the advantage of providing intelligible rules to describe the datasets. In TFBS prediction, logic minimization produced a very small set of simple rules. In glycosylation site prediction, the rules produced were also interpretable and the most popular rules generated appeared to correlate well with recently reported hydrophilic/hydrophobic enhancement values of amino acids around possible O-glycosylation sites. Experiments with Self-Organizing Neural Networks corroborate the practical worth of the logic minimization method for these case studies. CONCLUSIONS: The proposed logic minimization algorithm provides sets of rules that can be used to predict TFBS and O-glycosylation sites with sensitivity and positive predictive value comparable to those from ANN and SVM. Moreover, the logic minimization method has the additional capability of generating interpretable rules that allow biological scientists to correlate the predictions with other experimental results and to form new hypotheses for further investigation. Additional experiments with alternative rule-extraction techniques demonstrate that the logic minimization method is able to produce accurate rules from datasets with large numbers of variables and limited numbers of positive examples. BioMed Central 2012-08-16 /pmc/articles/PMC3492099/ /pubmed/22897894 http://dx.doi.org/10.1186/1756-0381-5-10 Text en Copyright ©2012 Cruz-Cano et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Cruz-Cano, Raul
Lee, Mei-Ling Ting
Leung, Ming-Ying
Logic minimization and rule extraction for identification of functional sites in molecular sequences
title Logic minimization and rule extraction for identification of functional sites in molecular sequences
title_full Logic minimization and rule extraction for identification of functional sites in molecular sequences
title_fullStr Logic minimization and rule extraction for identification of functional sites in molecular sequences
title_full_unstemmed Logic minimization and rule extraction for identification of functional sites in molecular sequences
title_short Logic minimization and rule extraction for identification of functional sites in molecular sequences
title_sort logic minimization and rule extraction for identification of functional sites in molecular sequences
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3492099/
https://www.ncbi.nlm.nih.gov/pubmed/22897894
http://dx.doi.org/10.1186/1756-0381-5-10
work_keys_str_mv AT cruzcanoraul logicminimizationandruleextractionforidentificationoffunctionalsitesinmolecularsequences
AT leemeilingting logicminimizationandruleextractionforidentificationoffunctionalsitesinmolecularsequences
AT leungmingying logicminimizationandruleextractionforidentificationoffunctionalsitesinmolecularsequences