Cargando…

Time Series Anomaly Detection for CERN Large-Scale Computing Infrastructure

Anomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems c...

Descripción completa

Detalles Bibliográficos
Autor principal: Paltenghi, Matteo
Lenguaje:eng
Publicado: Politecnico di Milano 2020
Materias:
Acceso en línea:http://cds.cern.ch/record/2752641
Descripción
Sumario:Anomaly Detection in the CERN Data Center is a challenging task due to the large scale of the computing infrastructure and the large volume of data to monitor. At CERN, the current solution to spot anomalous server machines in the computing infrastructure relies on threshold-based alarming systems carefully set by the system managers on time series of performance metrics of each infrastructure component. The goal of this work is to relieve the burden of this complex task and explore fully automated machine learning solutions based on anomaly detection. Moreover, in most real industrial scenarios, labeled data to train supervised machine learning methods are unavailable due to their high cost or difficulties in their collection. Therefore, our focus is on fully unsupervised Anomaly Detection methods and we explore the current state of the art including both traditional anomaly detection ones and also recent successful deep anomaly detection approaches. In this work, we propose novel formulations of time series specific approaches (CNN Forecaster, VAR Forecaster) and adaptations to reuse traditional machine learning methods (LOF, OCSVM, IFOREST, KNN, PCA) and deep learning ones (Autoencoder Fully Connected, CNN Autoencoder, LSTM Autoencoder) with time series data. In addition we explore six ensemble strategies to combine the individual algorithm strengths. We then present a comparative study of these 10 individual methods and 6 ensemble strategies on the CERN use case for identifying the best approach for the specific problem characteristics of the CERN large-scale computing infrastructure. In addition, given the absence of labelled data we put in place an annotation system that enables to annotate time series efficiently. We use this to collect and create two new time series datasets for anomaly detection and that represent two different CERN user categories. The results of this study in terms of ROC-AUC detection performance and training time makes a strong point in favour of the traditional methods that for the specific problem at hand work extremely well; on the other hand we also observe that they tend to be outperformed by deep methods whenever the time series patterns for normal instances become more complex. In parallel with our comparative evaluation study, we also publish an open source proof-of-concept anomaly detection system.