Cargando…

End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure

The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud servic...

Descripción completa

Detalles Bibliográficos
Autor principal: Metaj, Stiven
Lenguaje:eng
Publicado: 2022
Materias:
Acceso en línea:http://cds.cern.ch/record/2810448
_version_ 1780973220339384320
author Metaj, Stiven
author_facet Metaj, Stiven
author_sort Metaj, Stiven
collection CERN
description The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data.
id cern-2810448
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2022
record_format invenio
spelling cern-28104482022-06-03T21:12:29Zhttp://cds.cern.ch/record/2810448engMetaj, StivenEnd-to-end anomaly detection system in the CERN Openstack Cloud infrastructureComputing and ComputersThe CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data.CERN-THESIS-2022-049oai:cds.cern.ch:28104482022-05-24T19:33:58Z
spellingShingle Computing and Computers
Metaj, Stiven
End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_full End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_fullStr End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_full_unstemmed End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_short End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_sort end-to-end anomaly detection system in the cern openstack cloud infrastructure
topic Computing and Computers
url http://cds.cern.ch/record/2810448
work_keys_str_mv AT metajstiven endtoendanomalydetectionsysteminthecernopenstackcloudinfrastructure