Cargando…

Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality

The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained...

Descripción completa

Detalles Bibliográficos
Autores principales: Shetta, Omar, Niranjan, Mahesan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Royal Society 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7062061/
https://www.ncbi.nlm.nih.gov/pubmed/32257299
http://dx.doi.org/10.1098/rsos.190714
_version_ 1783504477956866048
author Shetta, Omar
Niranjan, Mahesan
author_facet Shetta, Omar
Niranjan, Mahesan
author_sort Shetta, Omar
collection PubMed
description The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.
format Online
Article
Text
id pubmed-7062061
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher The Royal Society
record_format MEDLINE/PubMed
spelling pubmed-70620612020-03-31 Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality Shetta, Omar Niranjan, Mahesan R Soc Open Sci Computer Science and Artificial Intelligence The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques. The Royal Society 2020-02-05 /pmc/articles/PMC7062061/ /pubmed/32257299 http://dx.doi.org/10.1098/rsos.190714 Text en © 2020 The Authors. http://creativecommons.org/licenses/by/4.0/ Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.
spellingShingle Computer Science and Artificial Intelligence
Shetta, Omar
Niranjan, Mahesan
Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
title Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
title_full Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
title_fullStr Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
title_full_unstemmed Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
title_short Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
title_sort robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality
topic Computer Science and Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7062061/
https://www.ncbi.nlm.nih.gov/pubmed/32257299
http://dx.doi.org/10.1098/rsos.190714
work_keys_str_mv AT shettaomar robustsubspacemethodsforoutlierdetectioningenomicdatacircumventsthecurseofdimensionality
AT niranjanmahesan robustsubspacemethodsforoutlierdetectioningenomicdatacircumventsthecurseofdimensionality