Cargando…

A kernel-based approach for detecting outliers of high-dimensional biological data

BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback...

Descripción completa

Detalles Bibliográficos
Autores principales:	Oh, Jung Hun, Gao, Jean
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2681063/ https://www.ncbi.nlm.nih.gov/pubmed/19426455 http://dx.doi.org/10.1186/1471-2105-10-S4-S7

_version_	1782167008733822976
author	Oh, Jung Hun Gao, Jean
author_facet	Oh, Jung Hun Gao, Jean
author_sort	Oh, Jung Hun
collection	PubMed
description	BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers. CONCLUSION: Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.
format	Text
id	pubmed-2681063
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-26810632009-05-13 A kernel-based approach for detecting outliers of high-dimensional biological data Oh, Jung Hun Gao, Jean BMC Bioinformatics Proceedings BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers. CONCLUSION: Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets. BioMed Central 2009-04-29 /pmc/articles/PMC2681063/ /pubmed/19426455 http://dx.doi.org/10.1186/1471-2105-10-S4-S7 Text en Copyright © 2009 Oh and Gao; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Proceedings Oh, Jung Hun Gao, Jean A kernel-based approach for detecting outliers of high-dimensional biological data
title	A kernel-based approach for detecting outliers of high-dimensional biological data
title_full	A kernel-based approach for detecting outliers of high-dimensional biological data
title_fullStr	A kernel-based approach for detecting outliers of high-dimensional biological data
title_full_unstemmed	A kernel-based approach for detecting outliers of high-dimensional biological data
title_short	A kernel-based approach for detecting outliers of high-dimensional biological data
title_sort	kernel-based approach for detecting outliers of high-dimensional biological data
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2681063/ https://www.ncbi.nlm.nih.gov/pubmed/19426455 http://dx.doi.org/10.1186/1471-2105-10-S4-S7
work_keys_str_mv	AT ohjunghun akernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata AT gaojean akernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata AT ohjunghun kernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata AT gaojean kernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata

A kernel-based approach for detecting outliers of high-dimensional biological data

Ejemplares similares