Cargando…

Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure

Anomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems c...

Descripción completa

Detalles Bibliográficos
Autor principal: Paltenghi, Matteo
Lenguaje:eng
Publicado: Politecnico di Milano 2020
Materias:
Acceso en línea:http://cds.cern.ch/record/2752641
_version_ 1780969294718304256
author Paltenghi, Matteo
author_facet Paltenghi, Matteo
author_sort Paltenghi, Matteo
collection CERN
description Anomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems carefully set by the system managers on time series of performance metrics of each infrastructure component. The goal of this work is to relieve the burden of this complex task and explore fully automated machine learning solutions based on anomaly detection. Moreover, in most real industrial scenarios, labeled data to train supervised machine learning methods are unavailable due to their high cost or difficulties in their collection. Therefore, our focus is on fully unsupervised Anomaly Detection methods and we explore the current state of the art including both traditional anomaly detection ones and also recent successful deep anomaly detection approaches. In this work, we propose novel formulations of time series specific approaches (CNN Forecaster, VAR Forecaster) and adaptations to reuse traditional machine learning methods (LOF, OCSVM, IFOREST, KNN, PCA) and deep learning ones (Autoencoder Fully Connected, CNN Autoencoder, LSTM Autoencoder) with time series data. In addition we explore six ensemble strategies to combine the individual algorithm strengths. We then present a comparative study of these 10 individual methods and 6 ensemble strategies on the CERN use case for identifying the best approach for the specific problem characteristics of the CERN large-scale computing infrastructure. In addition, given the absence of labelled data we put in place an annotation system that enables to annotate time series efficiently. We use this to collect and create two new time series datasets for anomaly detection and that represent two different CERN user categories. The results of this study in terms of ROC-AUC detection performance and training time makes a strong point in favour of the traditional methods that for the specific problem at hand work extremely well; on the other hand we also observe that they tend to be outperformed by deep methods whenever the time series patterns for normal instances become more complex. In parallel with our comparative evaluation study, we also publish an open source proof-of-concept anomaly detection system.
id cern-2752641
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2020
publisher Politecnico di Milano
record_format invenio
spelling cern-27526412022-01-31T15:02:04Zhttp://cds.cern.ch/record/2752641engPaltenghi, MatteoTime Series Anomaly Detection for CERN Large-Scale Computing InfrastructureComputing and ComputersAnomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems carefully set by the system managers on time series of performance metrics of each infrastructure component. The goal of this work is to relieve the burden of this complex task and explore fully automated machine learning solutions based on anomaly detection. Moreover, in most real industrial scenarios, labeled data to train supervised machine learning methods are unavailable due to their high cost or difficulties in their collection. Therefore, our focus is on fully unsupervised Anomaly Detection methods and we explore the current state of the art including both traditional anomaly detection ones and also recent successful deep anomaly detection approaches. In this work, we propose novel formulations of time series specific approaches (CNN Forecaster, VAR Forecaster) and adaptations to reuse traditional machine learning methods (LOF, OCSVM, IFOREST, KNN, PCA) and deep learning ones (Autoencoder Fully Connected, CNN Autoencoder, LSTM Autoencoder) with time series data. In addition we explore six ensemble strategies to combine the individual algorithm strengths. We then present a comparative study of these 10 individual methods and 6 ensemble strategies on the CERN use case for identifying the best approach for the specific problem characteristics of the CERN large-scale computing infrastructure. In addition, given the absence of labelled data we put in place an annotation system that enables to annotate time series efficiently. We use this to collect and create two new time series datasets for anomaly detection and that represent two different CERN user categories. The results of this study in terms of ROC-AUC detection performance and training time makes a strong point in favour of the traditional methods that for the specific problem at hand work extremely well; on the other hand we also observe that they tend to be outperformed by deep methods whenever the time series patterns for normal instances become more complex. In parallel with our comparative evaluation study, we also publish an open source proof-of-concept anomaly detection system.Politecnico di MilanoCERN-THESIS-2020-282oai:cds.cern.ch:27526412020-10-02
spellingShingle Computing and Computers
Paltenghi, Matteo
Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure
title Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure
title_full Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure
title_fullStr Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure
title_full_unstemmed Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure
title_short Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure
title_sort time series anomaly detection for cern large-scale computing infrastructure
topic Computing and Computers
url http://cds.cern.ch/record/2752641
work_keys_str_mv AT paltenghimatteo timeseriesanomalydetectionforcernlargescalecomputinginfrastructure