Cargando…
Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure
Anomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems c...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
Politecnico di Milano
2020
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2752641 |
_version_ | 1780969294718304256 |
---|---|
author | Paltenghi, Matteo |
author_facet | Paltenghi, Matteo |
author_sort | Paltenghi, Matteo |
collection | CERN |
description | Anomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems carefully set by the system managers on time series of performance metrics of each infrastructure component. The goal of this work is to relieve the burden of this complex task and explore fully automated machine learning solutions based on anomaly detection. Moreover, in most real industrial scenarios, labeled data to train supervised machine learning methods are unavailable due to their high cost or difficulties in their collection. Therefore, our focus is on fully unsupervised Anomaly Detection methods and we explore the current state of the art including both traditional anomaly detection ones and also recent successful deep anomaly detection approaches. In this work, we propose novel formulations of time series specific approaches (CNN Forecaster, VAR Forecaster) and adaptations to reuse traditional machine learning methods (LOF, OCSVM, IFOREST, KNN, PCA) and deep learning ones (Autoencoder Fully Connected, CNN Autoencoder, LSTM Autoencoder) with time series data. In addition we explore six ensemble strategies to combine the individual algorithm strengths. We then present a comparative study of these 10 individual methods and 6 ensemble strategies on the CERN use case for identifying the best approach for the specific problem characteristics of the CERN large-scale computing infrastructure. In addition, given the absence of labelled data we put in place an annotation system that enables to annotate time series efficiently. We use this to collect and create two new time series datasets for anomaly detection and that represent two different CERN user categories. The results of this study in terms of ROC-AUC detection performance and training time makes a strong point in favour of the traditional methods that for the specific problem at hand work extremely well; on the other hand we also observe that they tend to be outperformed by deep methods whenever the time series patterns for normal instances become more complex. In parallel with our comparative evaluation study, we also publish an open source proof-of-concept anomaly detection system. |
id | cern-2752641 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2020 |
publisher | Politecnico di Milano |
record_format | invenio |
spelling | cern-27526412022-01-31T15:02:04Zhttp://cds.cern.ch/record/2752641engPaltenghi, MatteoTime Series Anomaly Detection for CERN Large-Scale Computing InfrastructureComputing and ComputersAnomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems carefully set by the system managers on time series of performance metrics of each infrastructure component. The goal of this work is to relieve the burden of this complex task and explore fully automated machine learning solutions based on anomaly detection. Moreover, in most real industrial scenarios, labeled data to train supervised machine learning methods are unavailable due to their high cost or difficulties in their collection. Therefore, our focus is on fully unsupervised Anomaly Detection methods and we explore the current state of the art including both traditional anomaly detection ones and also recent successful deep anomaly detection approaches. In this work, we propose novel formulations of time series specific approaches (CNN Forecaster, VAR Forecaster) and adaptations to reuse traditional machine learning methods (LOF, OCSVM, IFOREST, KNN, PCA) and deep learning ones (Autoencoder Fully Connected, CNN Autoencoder, LSTM Autoencoder) with time series data. In addition we explore six ensemble strategies to combine the individual algorithm strengths. We then present a comparative study of these 10 individual methods and 6 ensemble strategies on the CERN use case for identifying the best approach for the specific problem characteristics of the CERN large-scale computing infrastructure. In addition, given the absence of labelled data we put in place an annotation system that enables to annotate time series efficiently. We use this to collect and create two new time series datasets for anomaly detection and that represent two different CERN user categories. The results of this study in terms of ROC-AUC detection performance and training time makes a strong point in favour of the traditional methods that for the specific problem at hand work extremely well; on the other hand we also observe that they tend to be outperformed by deep methods whenever the time series patterns for normal instances become more complex. In parallel with our comparative evaluation study, we also publish an open source proof-of-concept anomaly detection system.Politecnico di MilanoCERN-THESIS-2020-282oai:cds.cern.ch:27526412020-10-02 |
spellingShingle | Computing and Computers Paltenghi, Matteo Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure |
title | Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure |
title_full | Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure |
title_fullStr | Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure |
title_full_unstemmed | Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure |
title_short | Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure |
title_sort | time series anomaly detection for cern large-scale computing infrastructure |
topic | Computing and Computers |
url | http://cds.cern.ch/record/2752641 |
work_keys_str_mv | AT paltenghimatteo timeseriesanomalydetectionforcernlargescalecomputinginfrastructure |