Cargando…

MS-kNN: protein function prediction by integrating multiple data sources

BACKGROUND: Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional proper...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lan, Liang, Djuric, Nemanja, Guo, Yuhong, Vucetic, Slobodan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3584913/ https://www.ncbi.nlm.nih.gov/pubmed/23514608 http://dx.doi.org/10.1186/1471-2105-14-S3-S8

_version_	1782261075966689280
author	Lan, Liang Djuric, Nemanja Guo, Yuhong Vucetic, Slobodan
author_facet	Lan, Liang Djuric, Nemanja Guo, Yuhong Vucetic, Slobodan
author_sort	Lan, Liang
collection	PubMed
description	BACKGROUND: Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions. RESULTS: We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-kNN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-kNN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-kNN was rather small. CONCLUSIONS: Based on our results, we have several useful insights: (1) the k-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information.
format	Online Article Text
id	pubmed-3584913
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-35849132013-03-11 MS-kNN: protein function prediction by integrating multiple data sources Lan, Liang Djuric, Nemanja Guo, Yuhong Vucetic, Slobodan BMC Bioinformatics Proceedings BACKGROUND: Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions. RESULTS: We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-kNN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-kNN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-kNN was rather small. CONCLUSIONS: Based on our results, we have several useful insights: (1) the k-nearest neighbor algorithm is an efficient and effective model for protein function prediction; (2) it is beneficial to transfer functions across a wide range of organisms; (3) it is helpful to integrate multiple sources of protein information. BioMed Central 2013-02-28 /pmc/articles/PMC3584913/ /pubmed/23514608 http://dx.doi.org/10.1186/1471-2105-14-S3-S8 Text en Copyright ©2013 Lan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Lan, Liang Djuric, Nemanja Guo, Yuhong Vucetic, Slobodan MS-kNN: protein function prediction by integrating multiple data sources
title	MS-kNN: protein function prediction by integrating multiple data sources
title_full	MS-kNN: protein function prediction by integrating multiple data sources
title_fullStr	MS-kNN: protein function prediction by integrating multiple data sources
title_full_unstemmed	MS-kNN: protein function prediction by integrating multiple data sources
title_short	MS-kNN: protein function prediction by integrating multiple data sources
title_sort	ms-knn: protein function prediction by integrating multiple data sources
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3584913/ https://www.ncbi.nlm.nih.gov/pubmed/23514608 http://dx.doi.org/10.1186/1471-2105-14-S3-S8
work_keys_str_mv	AT lanliang msknnproteinfunctionpredictionbyintegratingmultipledatasources AT djuricnemanja msknnproteinfunctionpredictionbyintegratingmultipledatasources AT guoyuhong msknnproteinfunctionpredictionbyintegratingmultipledatasources AT vuceticslobodan msknnproteinfunctionpredictionbyintegratingmultipledatasources

MS-kNN: protein function prediction by integrating multiple data sources

Ejemplares similares