Cargando…
End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud servic...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2022
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2810448 |
_version_ | 1780973220339384320 |
---|---|
author | Metaj, Stiven |
author_facet | Metaj, Stiven |
author_sort | Metaj, Stiven |
collection | CERN |
description | The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data. |
id | cern-2810448 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2022 |
record_format | invenio |
spelling | cern-28104482022-06-03T21:12:29Zhttp://cds.cern.ch/record/2810448engMetaj, StivenEnd-to-end anomaly detection system in the CERN Openstack Cloud infrastructureComputing and ComputersThe CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data.CERN-THESIS-2022-049oai:cds.cern.ch:28104482022-05-24T19:33:58Z |
spellingShingle | Computing and Computers Metaj, Stiven End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure |
title | End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure |
title_full | End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure |
title_fullStr | End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure |
title_full_unstemmed | End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure |
title_short | End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure |
title_sort | end-to-end anomaly detection system in the cern openstack cloud infrastructure |
topic | Computing and Computers |
url | http://cds.cern.ch/record/2810448 |
work_keys_str_mv | AT metajstiven endtoendanomalydetectionsysteminthecernopenstackcloudinfrastructure |