Cargando…

Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach †

To design an algorithm for detecting outliers over streaming data has become an important task in many common applications, arising in areas such as fraud detections, network analysis, environment monitoring and so forth. Due to the fact that real-time data may arrive in the form of streams rather t...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Kangqing, Shi, Wei, Santoro, Nicola
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7085525/
https://www.ncbi.nlm.nih.gov/pubmed/32110907
http://dx.doi.org/10.3390/s20051261
_version_ 1783508951673864192
author Yu, Kangqing
Shi, Wei
Santoro, Nicola
author_facet Yu, Kangqing
Shi, Wei
Santoro, Nicola
author_sort Yu, Kangqing
collection PubMed
description To design an algorithm for detecting outliers over streaming data has become an important task in many common applications, arising in areas such as fraud detections, network analysis, environment monitoring and so forth. Due to the fact that real-time data may arrive in the form of streams rather than batches, properties such as concept drift, temporal context, transiency, and uncertainty need to be considered. In addition, data processing needs to be incremental with limited memory resource, and scalable. These facts create big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in an incremental fashion, especially in the streaming environment. To address these problems, we first propose C_KDE_WR, which uses sliding window and kernel function to process the streaming data online, and reports its results demonstrating high throughput on handling real-time streaming data, implemented in a CUDA framework on Graphics Processing Unit (GPU). We also present another algorithm, C_LOF, based on a very popular and effective outlier detection algorithm called Local Outlier Factor (LOF) which unfortunately works only on batched data. Using a novel incremental approach that compensates the drawback of high complexity in LOF, we show how to implement it in a streaming context and to obtain results in a timely manner. Like C_KDE_WR, C_LOF also employs sliding-window and statistical-summary to help making decision based on the data in the current window. It also addresses all those challenges of streaming data as addressed in C_KDE_WR. In addition, we report the comparative evaluation on the accuracy of C_KDE_WR with the state-of-the-art SOD_GPU using Precision, Recall and F-score metrics. Furthermore, a t-test is also performed to demonstrate the significance of the improvement. We further report the testing results of C_LOF on different parameter settings and drew ROC and PR curve with their area under the curve (AUC) and Average Precision (AP) values calculated respectively. Experimental results show that C_LOF can overcome the masquerading problem, which often exists in outlier detection on streaming data. We provide complexity analysis and report experiment results on the accuracy of both C_KDE_WR and C_LOF algorithms in order to evaluate their effectiveness as well as their efficiencies.
format Online
Article
Text
id pubmed-7085525
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-70855252020-03-23 Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach † Yu, Kangqing Shi, Wei Santoro, Nicola Sensors (Basel) Article To design an algorithm for detecting outliers over streaming data has become an important task in many common applications, arising in areas such as fraud detections, network analysis, environment monitoring and so forth. Due to the fact that real-time data may arrive in the form of streams rather than batches, properties such as concept drift, temporal context, transiency, and uncertainty need to be considered. In addition, data processing needs to be incremental with limited memory resource, and scalable. These facts create big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in an incremental fashion, especially in the streaming environment. To address these problems, we first propose C_KDE_WR, which uses sliding window and kernel function to process the streaming data online, and reports its results demonstrating high throughput on handling real-time streaming data, implemented in a CUDA framework on Graphics Processing Unit (GPU). We also present another algorithm, C_LOF, based on a very popular and effective outlier detection algorithm called Local Outlier Factor (LOF) which unfortunately works only on batched data. Using a novel incremental approach that compensates the drawback of high complexity in LOF, we show how to implement it in a streaming context and to obtain results in a timely manner. Like C_KDE_WR, C_LOF also employs sliding-window and statistical-summary to help making decision based on the data in the current window. It also addresses all those challenges of streaming data as addressed in C_KDE_WR. In addition, we report the comparative evaluation on the accuracy of C_KDE_WR with the state-of-the-art SOD_GPU using Precision, Recall and F-score metrics. Furthermore, a t-test is also performed to demonstrate the significance of the improvement. We further report the testing results of C_LOF on different parameter settings and drew ROC and PR curve with their area under the curve (AUC) and Average Precision (AP) values calculated respectively. Experimental results show that C_LOF can overcome the masquerading problem, which often exists in outlier detection on streaming data. We provide complexity analysis and report experiment results on the accuracy of both C_KDE_WR and C_LOF algorithms in order to evaluate their effectiveness as well as their efficiencies. MDPI 2020-02-26 /pmc/articles/PMC7085525/ /pubmed/32110907 http://dx.doi.org/10.3390/s20051261 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Yu, Kangqing
Shi, Wei
Santoro, Nicola
Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach †
title Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach †
title_full Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach †
title_fullStr Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach †
title_full_unstemmed Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach †
title_short Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incremental Approach †
title_sort designing a streaming algorithm for outlier detection in data mining—an incremental approach †
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7085525/
https://www.ncbi.nlm.nih.gov/pubmed/32110907
http://dx.doi.org/10.3390/s20051261
work_keys_str_mv AT yukangqing designingastreamingalgorithmforoutlierdetectionindatamininganincrementalapproach
AT shiwei designingastreamingalgorithmforoutlierdetectionindatamininganincrementalapproach
AT santoronicola designingastreamingalgorithmforoutlierdetectionindatamininganincrementalapproach