Cargando…

A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

BACKGROUND: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. METHODS: In this paper, we address this issue by proposing a general framework for gene function pred...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yao, Zizhen, Ruzzo, Walter L
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810312/ https://www.ncbi.nlm.nih.gov/pubmed/16723004 http://dx.doi.org/10.1186/1471-2105-7-S1-S11

_version_	1782132575686361088
author	Yao, Zizhen Ruzzo, Walter L
author_facet	Yao, Zizhen Ruzzo, Walter L
author_sort	Yao, Zizhen
collection	PubMed
description	BACKGROUND: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. METHODS: In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. RESULTS: We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. CONCLUSION: Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets.
format	Text
id	pubmed-1810312
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18103122007-03-14 A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data Yao, Zizhen Ruzzo, Walter L BMC Bioinformatics Proceedings BACKGROUND: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. METHODS: In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. RESULTS: We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. CONCLUSION: Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets. BioMed Central 2006-03-20 /pmc/articles/PMC1810312/ /pubmed/16723004 http://dx.doi.org/10.1186/1471-2105-7-S1-S11 Text en
spellingShingle	Proceedings Yao, Zizhen Ruzzo, Walter L A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data
title	A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data
title_full	A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data
title_fullStr	A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data
title_full_unstemmed	A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data
title_short	A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data
title_sort	regression-based k nearest neighbor algorithm for gene function prediction from heterogeneous data
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810312/ https://www.ncbi.nlm.nih.gov/pubmed/16723004 http://dx.doi.org/10.1186/1471-2105-7-S1-S11
work_keys_str_mv	AT yaozizhen aregressionbasedknearestneighboralgorithmforgenefunctionpredictionfromheterogeneousdata AT ruzzowalterl aregressionbasedknearestneighboralgorithmforgenefunctionpredictionfromheterogeneousdata AT yaozizhen regressionbasedknearestneighboralgorithmforgenefunctionpredictionfromheterogeneousdata AT ruzzowalterl regressionbasedknearestneighboralgorithmforgenefunctionpredictionfromheterogeneousdata

A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

Ejemplares similares