Cargando…

SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition

BACKGROUND: Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers....

Descripción completa

Detalles Bibliográficos
Autores principales: Melvin, Iain, Ie, Eugene, Kuang, Rui, Weston, Jason, Stafford, William Noble, Leslie, Christina
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892081/
https://www.ncbi.nlm.nih.gov/pubmed/17570145
http://dx.doi.org/10.1186/1471-2105-8-S4-S2
_version_ 1782133822765137920
author Melvin, Iain
Ie, Eugene
Kuang, Rui
Weston, Jason
Stafford, William Noble
Leslie, Christina
author_facet Melvin, Iain
Ie, Eugene
Kuang, Rui
Weston, Jason
Stafford, William Noble
Leslie, Christina
author_sort Melvin, Iain
collection PubMed
description BACKGROUND: Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. RESULTS: We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. CONCLUSION: By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.
format Text
id pubmed-1892081
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18920812007-06-15 SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition Melvin, Iain Ie, Eugene Kuang, Rui Weston, Jason Stafford, William Noble Leslie, Christina BMC Bioinformatics Proceedings BACKGROUND: Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. RESULTS: We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. CONCLUSION: By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition. BioMed Central 2007-05-22 /pmc/articles/PMC1892081/ /pubmed/17570145 http://dx.doi.org/10.1186/1471-2105-8-S4-S2 Text en Copyright © 2007 Melvin et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Melvin, Iain
Ie, Eugene
Kuang, Rui
Weston, Jason
Stafford, William Noble
Leslie, Christina
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
title SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
title_full SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
title_fullStr SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
title_full_unstemmed SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
title_short SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
title_sort svm-fold: a tool for discriminative multi-class protein fold and superfamily recognition
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1892081/
https://www.ncbi.nlm.nih.gov/pubmed/17570145
http://dx.doi.org/10.1186/1471-2105-8-S4-S2
work_keys_str_mv AT melviniain svmfoldatoolfordiscriminativemulticlassproteinfoldandsuperfamilyrecognition
AT ieeugene svmfoldatoolfordiscriminativemulticlassproteinfoldandsuperfamilyrecognition
AT kuangrui svmfoldatoolfordiscriminativemulticlassproteinfoldandsuperfamilyrecognition
AT westonjason svmfoldatoolfordiscriminativemulticlassproteinfoldandsuperfamilyrecognition
AT staffordwilliamnoble svmfoldatoolfordiscriminativemulticlassproteinfoldandsuperfamilyrecognition
AT lesliechristina svmfoldatoolfordiscriminativemulticlassproteinfoldandsuperfamilyrecognition