Cargando…

A kernel-based approach for detecting outliers of high-dimensional biological data

BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback...

Descripción completa

Detalles Bibliográficos
Autores principales: Oh, Jung Hun, Gao, Jean
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2681063/
https://www.ncbi.nlm.nih.gov/pubmed/19426455
http://dx.doi.org/10.1186/1471-2105-10-S4-S7
_version_ 1782167008733822976
author Oh, Jung Hun
Gao, Jean
author_facet Oh, Jung Hun
Gao, Jean
author_sort Oh, Jung Hun
collection PubMed
description BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers. CONCLUSION: Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets.
format Text
id pubmed-2681063
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26810632009-05-13 A kernel-based approach for detecting outliers of high-dimensional biological data Oh, Jung Hun Gao, Jean BMC Bioinformatics Proceedings BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers. CONCLUSION: Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets. BioMed Central 2009-04-29 /pmc/articles/PMC2681063/ /pubmed/19426455 http://dx.doi.org/10.1186/1471-2105-10-S4-S7 Text en Copyright © 2009 Oh and Gao; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Proceedings
Oh, Jung Hun
Gao, Jean
A kernel-based approach for detecting outliers of high-dimensional biological data
title A kernel-based approach for detecting outliers of high-dimensional biological data
title_full A kernel-based approach for detecting outliers of high-dimensional biological data
title_fullStr A kernel-based approach for detecting outliers of high-dimensional biological data
title_full_unstemmed A kernel-based approach for detecting outliers of high-dimensional biological data
title_short A kernel-based approach for detecting outliers of high-dimensional biological data
title_sort kernel-based approach for detecting outliers of high-dimensional biological data
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2681063/
https://www.ncbi.nlm.nih.gov/pubmed/19426455
http://dx.doi.org/10.1186/1471-2105-10-S4-S7
work_keys_str_mv AT ohjunghun akernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata
AT gaojean akernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata
AT ohjunghun kernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata
AT gaojean kernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata