Cargando…

End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure

The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud servic...

Descripción completa

Detalles Bibliográficos
Autor principal:	Metaj, Stiven
Lenguaje:	eng
Publicado:	2022
Materias:	Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/2810448

_version_	1780973220339384320
author	Metaj, Stiven
author_facet	Metaj, Stiven
author_sort	Metaj, Stiven
collection	CERN
description	The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data.
id	cern-2810448
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2022
record_format	invenio
spelling	cern-28104482022-06-03T21:12:29Zhttp://cds.cern.ch/record/2810448engMetaj, StivenEnd-to-end anomaly detection system in the CERN Openstack Cloud infrastructureComputing and ComputersThe CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data.CERN-THESIS-2022-049oai:cds.cern.ch:28104482022-05-24T19:33:58Z
spellingShingle	Computing and Computers Metaj, Stiven End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title	End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_full	End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_fullStr	End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_full_unstemmed	End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_short	End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
title_sort	end-to-end anomaly detection system in the cern openstack cloud infrastructure
topic	Computing and Computers
url	http://cds.cern.ch/record/2810448
work_keys_str_mv	AT metajstiven endtoendanomalydetectionsysteminthecernopenstackcloudinfrastructure

End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure

Ejemplares similares