Cargando…

Integrative approaches to the prediction of protein functions based on the feature selection

BACKGROUND: Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction....

Descripción completa

Detalles Bibliográficos
Autores principales:	Ko, Seokha, Lee, Hyunju
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Methodology article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813249/ https://www.ncbi.nlm.nih.gov/pubmed/20043848 http://dx.doi.org/10.1186/1471-2105-10-455

_version_	1782176901066915840
author	Ko, Seokha Lee, Hyunju
author_facet	Ko, Seokha Lee, Hyunju
author_sort	Ko, Seokha
collection	PubMed
description	BACKGROUND: Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue. RESULTS: We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy. CONCLUSIONS: In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.
format	Text
id	pubmed-2813249
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28132492010-01-29 Integrative approaches to the prediction of protein functions based on the feature selection Ko, Seokha Lee, Hyunju BMC Bioinformatics Methodology article BACKGROUND: Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue. RESULTS: We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy. CONCLUSIONS: In contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions. BioMed Central 2009-12-31 /pmc/articles/PMC2813249/ /pubmed/20043848 http://dx.doi.org/10.1186/1471-2105-10-455 Text en Copyright ©2009 Ko and Lee; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology article Ko, Seokha Lee, Hyunju Integrative approaches to the prediction of protein functions based on the feature selection
title	Integrative approaches to the prediction of protein functions based on the feature selection
title_full	Integrative approaches to the prediction of protein functions based on the feature selection
title_fullStr	Integrative approaches to the prediction of protein functions based on the feature selection
title_full_unstemmed	Integrative approaches to the prediction of protein functions based on the feature selection
title_short	Integrative approaches to the prediction of protein functions based on the feature selection
title_sort	integrative approaches to the prediction of protein functions based on the feature selection
topic	Methodology article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813249/ https://www.ncbi.nlm.nih.gov/pubmed/20043848 http://dx.doi.org/10.1186/1471-2105-10-455
work_keys_str_mv	AT koseokha integrativeapproachestothepredictionofproteinfunctionsbasedonthefeatureselection AT leehyunju integrativeapproachestothepredictionofproteinfunctionsbasedonthefeatureselection

Integrative approaches to the prediction of protein functions based on the feature selection

Ejemplares similares