Cargando…

Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy

BACKGROUND: Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, an...

Descripción completa

Detalles Bibliográficos
Autores principales: Jiang, Rui, Yang, Hua, Sun, Fengzhu, Chen, Ting
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618409/
https://www.ncbi.nlm.nih.gov/pubmed/16984653
http://dx.doi.org/10.1186/1471-2105-7-417
_version_ 1782130520315920384
author Jiang, Rui
Yang, Hua
Sun, Fengzhu
Chen, Ting
author_facet Jiang, Rui
Yang, Hua
Sun, Fengzhu
Chen, Ting
author_sort Jiang, Rui
collection PubMed
description BACKGROUND: Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, and evolutionary properties of proteins, the applications are limited by the complication of the features and the availability of protein structural information. Another limitation is that the prediction results are hard to be interpreted with physicochemical principles and biological knowledge. RESULTS: To overcome these limitations, we proposed a novel feature set using physicochemical properties of amino acids, evolutionary profiles of proteins, and protein sequence information. We applied the support vector machine and the random forest with the feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, as well as to annotated amino acid substitutions occurring in a wide range of human proteins. The results showed that the proposed feature set was superior to the existing ones. To explore physicochemical principles behind amino acid substitutions, we designed a simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. We applied the strategy to annotated human amino acid substitutions and successfully extracted several rules which were either consistent with current biological knowledge or providing new insights for the understanding of amino acid substitutions. When applied to unclassified data, these rules could cover a large portion of samples, and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest. CONCLUSION: The prediction methods using the proposed feature set can achieve larger AUC (the area under the ROC curve), smaller BER (the balanced error rate), and larger MCC (the Matthews' correlation coefficient) than those using the published feature sets, suggesting that our feature set is superior to the existing ones. The rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability as those extracted by the patient rule induction method (PRIM), revealing that the strategy is more effective in inducing interpretable rules.
format Text
id pubmed-1618409
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-16184092006-10-20 Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy Jiang, Rui Yang, Hua Sun, Fengzhu Chen, Ting BMC Bioinformatics Research Article BACKGROUND: Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, and evolutionary properties of proteins, the applications are limited by the complication of the features and the availability of protein structural information. Another limitation is that the prediction results are hard to be interpreted with physicochemical principles and biological knowledge. RESULTS: To overcome these limitations, we proposed a novel feature set using physicochemical properties of amino acids, evolutionary profiles of proteins, and protein sequence information. We applied the support vector machine and the random forest with the feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, as well as to annotated amino acid substitutions occurring in a wide range of human proteins. The results showed that the proposed feature set was superior to the existing ones. To explore physicochemical principles behind amino acid substitutions, we designed a simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. We applied the strategy to annotated human amino acid substitutions and successfully extracted several rules which were either consistent with current biological knowledge or providing new insights for the understanding of amino acid substitutions. When applied to unclassified data, these rules could cover a large portion of samples, and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest. CONCLUSION: The prediction methods using the proposed feature set can achieve larger AUC (the area under the ROC curve), smaller BER (the balanced error rate), and larger MCC (the Matthews' correlation coefficient) than those using the published feature sets, suggesting that our feature set is superior to the existing ones. The rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability as those extracted by the patient rule induction method (PRIM), revealing that the strategy is more effective in inducing interpretable rules. BioMed Central 2006-09-19 /pmc/articles/PMC1618409/ /pubmed/16984653 http://dx.doi.org/10.1186/1471-2105-7-417 Text en Copyright © 2006 Jiang et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Jiang, Rui
Yang, Hua
Sun, Fengzhu
Chen, Ting
Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy
title Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy
title_full Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy
title_fullStr Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy
title_full_unstemmed Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy
title_short Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy
title_sort searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618409/
https://www.ncbi.nlm.nih.gov/pubmed/16984653
http://dx.doi.org/10.1186/1471-2105-7-417
work_keys_str_mv AT jiangrui searchingforinterpretablerulesfordiseasemutationsasimulatedannealingbumphuntingstrategy
AT yanghua searchingforinterpretablerulesfordiseasemutationsasimulatedannealingbumphuntingstrategy
AT sunfengzhu searchingforinterpretablerulesfordiseasemutationsasimulatedannealingbumphuntingstrategy
AT chenting searchingforinterpretablerulesfordiseasemutationsasimulatedannealingbumphuntingstrategy