Cargando…

Enhanced protein fold recognition through a novel data integration approach

BACKGROUND: Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the is...

Descripción completa

Detalles Bibliográficos
Autores principales: Ying, Yiming, Huang, Kaizhu, Campbell, Colin
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761901/
https://www.ncbi.nlm.nih.gov/pubmed/19709406
http://dx.doi.org/10.1186/1471-2105-10-267
_version_ 1782172869086674944
author Ying, Yiming
Huang, Kaizhu
Campbell, Colin
author_facet Ying, Yiming
Huang, Kaizhu
Campbell, Colin
author_sort Ying, Yiming
collection PubMed
description BACKGROUND: Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources. RESULTS: In this paper we consider the problem of integrating multiple data sources using a kernel-based approach. We propose a novel information-theoretic approach based on a Kullback-Leibler (KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate heterogeneous data sources. One of the most appealing properties of this approach is that it can easily cope with multi-class classification and multi-task learning by an appropriate choice of the output kernel matrix. Based on the position of the output and input kernel matrices in the KL-divergence objective, there are two formulations which we respectively refer to as MKLdiv-dc and MKLdiv-conv. We propose to efficiently solve MKLdiv-dc by a difference of convex (DC) programming method and MKLdiv-conv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem. CONCLUSION: Our proposed methods MKLdiv-dc and MKLdiv-conv are able to achieve state-of-the-art performance on the SCOP PDB-40D benchmark dataset for protein fold prediction and provide useful insights into the relative significance of informative data sources. In particular, MKLdiv-dc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over competitive Bayesian probabilistic and SVM margin-based kernel learning methods. Furthermore, we report a competitive performance on the yeast protein function prediction problem.
format Text
id pubmed-2761901
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27619012009-10-15 Enhanced protein fold recognition through a novel data integration approach Ying, Yiming Huang, Kaizhu Campbell, Colin BMC Bioinformatics Research article BACKGROUND: Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources. RESULTS: In this paper we consider the problem of integrating multiple data sources using a kernel-based approach. We propose a novel information-theoretic approach based on a Kullback-Leibler (KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate heterogeneous data sources. One of the most appealing properties of this approach is that it can easily cope with multi-class classification and multi-task learning by an appropriate choice of the output kernel matrix. Based on the position of the output and input kernel matrices in the KL-divergence objective, there are two formulations which we respectively refer to as MKLdiv-dc and MKLdiv-conv. We propose to efficiently solve MKLdiv-dc by a difference of convex (DC) programming method and MKLdiv-conv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem. CONCLUSION: Our proposed methods MKLdiv-dc and MKLdiv-conv are able to achieve state-of-the-art performance on the SCOP PDB-40D benchmark dataset for protein fold prediction and provide useful insights into the relative significance of informative data sources. In particular, MKLdiv-dc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over competitive Bayesian probabilistic and SVM margin-based kernel learning methods. Furthermore, we report a competitive performance on the yeast protein function prediction problem. BioMed Central 2009-08-26 /pmc/articles/PMC2761901/ /pubmed/19709406 http://dx.doi.org/10.1186/1471-2105-10-267 Text en Copyright ©2009 Ying et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Ying, Yiming
Huang, Kaizhu
Campbell, Colin
Enhanced protein fold recognition through a novel data integration approach
title Enhanced protein fold recognition through a novel data integration approach
title_full Enhanced protein fold recognition through a novel data integration approach
title_fullStr Enhanced protein fold recognition through a novel data integration approach
title_full_unstemmed Enhanced protein fold recognition through a novel data integration approach
title_short Enhanced protein fold recognition through a novel data integration approach
title_sort enhanced protein fold recognition through a novel data integration approach
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761901/
https://www.ncbi.nlm.nih.gov/pubmed/19709406
http://dx.doi.org/10.1186/1471-2105-10-267
work_keys_str_mv AT yingyiming enhancedproteinfoldrecognitionthroughanoveldataintegrationapproach
AT huangkaizhu enhancedproteinfoldrecognitionthroughanoveldataintegrationapproach
AT campbellcolin enhancedproteinfoldrecognitionthroughanoveldataintegrationapproach