Cargando…
A kernel-based approach for detecting outliers of high-dimensional biological data
BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback...
Autores principales: | , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2681063/ https://www.ncbi.nlm.nih.gov/pubmed/19426455 http://dx.doi.org/10.1186/1471-2105-10-S4-S7 |
_version_ | 1782167008733822976 |
---|---|
author | Oh, Jung Hun Gao, Jean |
author_facet | Oh, Jung Hun Gao, Jean |
author_sort | Oh, Jung Hun |
collection | PubMed |
description | BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers. CONCLUSION: Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets. |
format | Text |
id | pubmed-2681063 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-26810632009-05-13 A kernel-based approach for detecting outliers of high-dimensional biological data Oh, Jung Hun Gao, Jean BMC Bioinformatics Proceedings BACKGROUND: In many cases biomedical data sets contain outliers that make it difficult to achieve reliable knowledge discovery. Data analysis without removing outliers could lead to wrong results and provide misleading information. RESULTS: We propose a new outlier detection method based on Kullback-Leibler (KL) divergence. The original concept of KL divergence was designed as a measure of distance between two distributions. Stemming from that, we extend it to biological sample outlier detection by forming sample sets composed of nearest neighbors. KL divergence is defined between two sample sets with and without the test sample. To handle the non-linearity of sample distribution, original data is mapped into a higher feature space. We address the singularity problem due to small sample size during KL divergence calculation. Kernel functions are applied to avoid direct use of mapping functions. The performance of the proposed method is demonstrated on a synthetic data set, two public microarray data sets, and a mass spectrometry data set for liver cancer study. Comparative studies with Mahalanobis distance based method and one-class support vector machine (SVM) are reported showing that the proposed method performs better in finding outliers. CONCLUSION: Our idea was derived from Markov blanket algorithm that is a feature selection method based on KL divergence. That is, while Markov blanket algorithm removes redundant and irrelevant features, our proposed method detects outliers. Compared to other algorithms, our proposed method shows better or comparable performance for small sample and high-dimensional biological data. This indicates that the proposed method can be used to detect outliers in biological data sets. BioMed Central 2009-04-29 /pmc/articles/PMC2681063/ /pubmed/19426455 http://dx.doi.org/10.1186/1471-2105-10-S4-S7 Text en Copyright © 2009 Oh and Gao; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Proceedings Oh, Jung Hun Gao, Jean A kernel-based approach for detecting outliers of high-dimensional biological data |
title | A kernel-based approach for detecting outliers of high-dimensional biological data |
title_full | A kernel-based approach for detecting outliers of high-dimensional biological data |
title_fullStr | A kernel-based approach for detecting outliers of high-dimensional biological data |
title_full_unstemmed | A kernel-based approach for detecting outliers of high-dimensional biological data |
title_short | A kernel-based approach for detecting outliers of high-dimensional biological data |
title_sort | kernel-based approach for detecting outliers of high-dimensional biological data |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2681063/ https://www.ncbi.nlm.nih.gov/pubmed/19426455 http://dx.doi.org/10.1186/1471-2105-10-S4-S7 |
work_keys_str_mv | AT ohjunghun akernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata AT gaojean akernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata AT ohjunghun kernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata AT gaojean kernelbasedapproachfordetectingoutliersofhighdimensionalbiologicaldata |