Cargando…

Integrative approaches to the prediction of protein functions based on the feature selection

BACKGROUND: Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction....

Descripción completa

Detalles Bibliográficos
Autores principales: Ko, Seokha, Lee, Hyunju
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813249/
https://www.ncbi.nlm.nih.gov/pubmed/20043848
http://dx.doi.org/10.1186/1471-2105-10-455
_version_ 1782176901066915840
author Ko, Seokha
Lee, Hyunju
author_facet Ko, Seokha
Lee, Hyunju
author_sort Ko, Seokha
collection PubMed
description BACKGROUND: Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue. RESULTS: We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy. CONCLUSIONS: In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.
format Text
id pubmed-2813249
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28132492010-01-29 Integrative approaches to the prediction of protein functions based on the feature selection Ko, Seokha Lee, Hyunju BMC Bioinformatics Methodology article BACKGROUND: Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue. RESULTS: We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy. CONCLUSIONS: In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions. BioMed Central 2009-12-31 /pmc/articles/PMC2813249/ /pubmed/20043848 http://dx.doi.org/10.1186/1471-2105-10-455 Text en Copyright ©2009 Ko and Lee; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology article
Ko, Seokha
Lee, Hyunju
Integrative approaches to the prediction of protein functions based on the feature selection
title Integrative approaches to the prediction of protein functions based on the feature selection
title_full Integrative approaches to the prediction of protein functions based on the feature selection
title_fullStr Integrative approaches to the prediction of protein functions based on the feature selection
title_full_unstemmed Integrative approaches to the prediction of protein functions based on the feature selection
title_short Integrative approaches to the prediction of protein functions based on the feature selection
title_sort integrative approaches to the prediction of protein functions based on the feature selection
topic Methodology article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813249/
https://www.ncbi.nlm.nih.gov/pubmed/20043848
http://dx.doi.org/10.1186/1471-2105-10-455
work_keys_str_mv AT koseokha integrativeapproachestothepredictionofproteinfunctionsbasedonthefeatureselection
AT leehyunju integrativeapproachestothepredictionofproteinfunctionsbasedonthefeatureselection