Cargando…

An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data

Outlier detection is an important task in the field of data mining and a highly active area of research in machine learning. In industrial automation, datasets are often high-dimensional, meaning an effort to study all dimensions directly leads to data sparsity, thus causing outliers to be masked by...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Zihao, Zhang, Liumei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10453693/
https://www.ncbi.nlm.nih.gov/pubmed/37628215
http://dx.doi.org/10.3390/e25081185
_version_ 1785095999529680896
author Li, Zihao
Zhang, Liumei
author_facet Li, Zihao
Zhang, Liumei
author_sort Li, Zihao
collection PubMed
description Outlier detection is an important task in the field of data mining and a highly active area of research in machine learning. In industrial automation, datasets are often high-dimensional, meaning an effort to study all dimensions directly leads to data sparsity, thus causing outliers to be masked by noise effects in high-dimensional spaces. The “curse of dimensionality” phenomenon renders many conventional outlier detection methods ineffective. This paper proposes a new outlier detection algorithm called EOEH (Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data). First, random secondary subsampling is performed on the data, and detectors are run on various small-scale sub-samples to provide diverse detection results. Results are then aggregated to reduce the global variance and enhance the robustness of the algorithm. Subsequently, information entropy is utilized to construct a dimension-space weighting method that can discern the influential factors within different dimensional spaces. This method generates weighted subspaces and dimensions for data objects, reducing the impact of noise created by high-dimensional data and improving high-dimensional data detection performance. Finally, this study offers a design for a new high-precision local outlier factor (HPLOF) detector that amplifies the differentiation between normal and outlier data, thereby improving the detection performance of the algorithm. The feasibility of this algorithm is validated through experiments that used both simulated and UCI datasets. In comparison to popular outlier detection algorithms, our algorithm demonstrates a superior detection performance and runtime efficiency. Compared with the current popular, common algorithms, the EOEH algorithm improves the detection performance by 6% on average. In terms of running time for high-dimensional data, EOEH is 20% faster than the current popular algorithms.
format Online
Article
Text
id pubmed-10453693
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-104536932023-08-26 An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data Li, Zihao Zhang, Liumei Entropy (Basel) Article Outlier detection is an important task in the field of data mining and a highly active area of research in machine learning. In industrial automation, datasets are often high-dimensional, meaning an effort to study all dimensions directly leads to data sparsity, thus causing outliers to be masked by noise effects in high-dimensional spaces. The “curse of dimensionality” phenomenon renders many conventional outlier detection methods ineffective. This paper proposes a new outlier detection algorithm called EOEH (Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data). First, random secondary subsampling is performed on the data, and detectors are run on various small-scale sub-samples to provide diverse detection results. Results are then aggregated to reduce the global variance and enhance the robustness of the algorithm. Subsequently, information entropy is utilized to construct a dimension-space weighting method that can discern the influential factors within different dimensional spaces. This method generates weighted subspaces and dimensions for data objects, reducing the impact of noise created by high-dimensional data and improving high-dimensional data detection performance. Finally, this study offers a design for a new high-precision local outlier factor (HPLOF) detector that amplifies the differentiation between normal and outlier data, thereby improving the detection performance of the algorithm. The feasibility of this algorithm is validated through experiments that used both simulated and UCI datasets. In comparison to popular outlier detection algorithms, our algorithm demonstrates a superior detection performance and runtime efficiency. Compared with the current popular, common algorithms, the EOEH algorithm improves the detection performance by 6% on average. In terms of running time for high-dimensional data, EOEH is 20% faster than the current popular algorithms. MDPI 2023-08-09 /pmc/articles/PMC10453693/ /pubmed/37628215 http://dx.doi.org/10.3390/e25081185 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Li, Zihao
Zhang, Liumei
An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data
title An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data
title_full An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data
title_fullStr An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data
title_full_unstemmed An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data
title_short An Ensemble Outlier Detection Method Based on Information Entropy-Weighted Subspaces for High-Dimensional Data
title_sort ensemble outlier detection method based on information entropy-weighted subspaces for high-dimensional data
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10453693/
https://www.ncbi.nlm.nih.gov/pubmed/37628215
http://dx.doi.org/10.3390/e25081185
work_keys_str_mv AT lizihao anensembleoutlierdetectionmethodbasedoninformationentropyweightedsubspacesforhighdimensionaldata
AT zhangliumei anensembleoutlierdetectionmethodbasedoninformationentropyweightedsubspacesforhighdimensionaldata
AT lizihao ensembleoutlierdetectionmethodbasedoninformationentropyweightedsubspacesforhighdimensionaldata
AT zhangliumei ensembleoutlierdetectionmethodbasedoninformationentropyweightedsubspacesforhighdimensionaldata