Cargando…

Phronesis, a diagnosis and recovery tool for system administrators

The LHCb experiment relies on the Online system, which includes a very large and heterogeneous computing cluster. Ensuring the proper behavior of the different tasks running on the more than 2000 servers represents a huge workload for the small operator team and is a 24/7 task. At CHEP 2012, we pres...

Descripción completa

Detalles Bibliográficos
Autores principales: Haen, C, Barra, V, Bonaccorsi, E, Neufeld, N
Publicado: 2014
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/513/6/062021
http://cds.cern.ch/record/2055729
_version_ 1780948314212007936
author Haen, C
Barra, V
Bonaccorsi, E
Neufeld, N
author_facet Haen, C
Barra, V
Bonaccorsi, E
Neufeld, N
author_sort Haen, C
collection CERN
description The LHCb experiment relies on the Online system, which includes a very large and heterogeneous computing cluster. Ensuring the proper behavior of the different tasks running on the more than 2000 servers represents a huge workload for the small operator team and is a 24/7 task. At CHEP 2012, we presented a prototype of a framework that we designed in order to support the experts. The main objective is to provide them with steadily improving diagnosis and recovery solutions in case of misbehavior of a service, without having to modify the original applications. Our framework is based on adapted principles of the Autonomic Computing model, on Reinforcement Learning algorithms, as well as innovative concepts such as Shared Experience. While the submission at CHEP 2012 showed the validity of our prototype on simulations, we here present an implementation with improved algorithms and manipulation tools, and report on the experience gained with running it in the LHCb Online system.
id cern-2055729
institution Organización Europea para la Investigación Nuclear
publishDate 2014
record_format invenio
spelling cern-20557292022-08-17T13:25:19Zdoi:10.1088/1742-6596/513/6/062021http://cds.cern.ch/record/2055729Haen, CBarra, VBonaccorsi, ENeufeld, NPhronesis, a diagnosis and recovery tool for system administratorsComputing and ComputersThe LHCb experiment relies on the Online system, which includes a very large and heterogeneous computing cluster. Ensuring the proper behavior of the different tasks running on the more than 2000 servers represents a huge workload for the small operator team and is a 24/7 task. At CHEP 2012, we presented a prototype of a framework that we designed in order to support the experts. The main objective is to provide them with steadily improving diagnosis and recovery solutions in case of misbehavior of a service, without having to modify the original applications. Our framework is based on adapted principles of the Autonomic Computing model, on Reinforcement Learning algorithms, as well as innovative concepts such as Shared Experience. While the submission at CHEP 2012 showed the validity of our prototype on simulations, we here present an implementation with improved algorithms and manipulation tools, and report on the experience gained with running it in the LHCb Online system.oai:cds.cern.ch:20557292014
spellingShingle Computing and Computers
Haen, C
Barra, V
Bonaccorsi, E
Neufeld, N
Phronesis, a diagnosis and recovery tool for system administrators
title Phronesis, a diagnosis and recovery tool for system administrators
title_full Phronesis, a diagnosis and recovery tool for system administrators
title_fullStr Phronesis, a diagnosis and recovery tool for system administrators
title_full_unstemmed Phronesis, a diagnosis and recovery tool for system administrators
title_short Phronesis, a diagnosis and recovery tool for system administrators
title_sort phronesis, a diagnosis and recovery tool for system administrators
topic Computing and Computers
url https://dx.doi.org/10.1088/1742-6596/513/6/062021
http://cds.cern.ch/record/2055729
work_keys_str_mv AT haenc phronesisadiagnosisandrecoverytoolforsystemadministrators
AT barrav phronesisadiagnosisandrecoverytoolforsystemadministrators
AT bonaccorsie phronesisadiagnosisandrecoverytoolforsystemadministrators
AT neufeldn phronesisadiagnosisandrecoverytoolforsystemadministrators