Cargando…

LHCb: Phronesis, a diagnosis and recovery tool for system administrators

The backbone of the LHCb experiment is the Online system, which is a very large and heterogeneous computing center. Making sure of the proper behavior of the many different tasks running on the more than 2000 servers represents a huge workload for the small expert-operator team and is a 24/7 task. A...

Descripción completa

Detalles Bibliográficos
Autores principales: Haen, C, Barra, V, Bonaccorsi, E, Neufeld, N
Lenguaje:eng
Publicado: 2013
Acceso en línea:http://cds.cern.ch/record/1610849
_version_ 1780932040719335424
author Haen, C
Barra, V
Bonaccorsi, E
Neufeld, N
author_facet Haen, C
Barra, V
Bonaccorsi, E
Neufeld, N
author_sort Haen, C
collection CERN
description The backbone of the LHCb experiment is the Online system, which is a very large and heterogeneous computing center. Making sure of the proper behavior of the many different tasks running on the more than 2000 servers represents a huge workload for the small expert-operator team and is a 24/7 task. At the occasion of CHEP 2012, we presented a prototype of a framework that we designed in order to support the experts. The main objective is to provide them with always improving diagnosis and recovery solutions in case of misbehavior of a service, without having to modify the original applications. Our framework is based on adapted principles of the Autonomic Computing model, on reinforcement learning algorithms, as well as innovative concepts such as Shared Experience. While the presentation made at CHEP 2012 showed the validity of our prototype on simulations, we here present a version with improved algorithms, manipulation tools, and report on experience with running it in the LHCb Online system.
id cern-1610849
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2013
record_format invenio
spelling cern-16108492019-09-30T06:29:59Zhttp://cds.cern.ch/record/1610849engHaen, CBarra, VBonaccorsi, ENeufeld, NLHCb: Phronesis, a diagnosis and recovery tool for system administratorsThe backbone of the LHCb experiment is the Online system, which is a very large and heterogeneous computing center. Making sure of the proper behavior of the many different tasks running on the more than 2000 servers represents a huge workload for the small expert-operator team and is a 24/7 task. At the occasion of CHEP 2012, we presented a prototype of a framework that we designed in order to support the experts. The main objective is to provide them with always improving diagnosis and recovery solutions in case of misbehavior of a service, without having to modify the original applications. Our framework is based on adapted principles of the Autonomic Computing model, on reinforcement learning algorithms, as well as innovative concepts such as Shared Experience. While the presentation made at CHEP 2012 showed the validity of our prototype on simulations, we here present a version with improved algorithms, manipulation tools, and report on experience with running it in the LHCb Online system.Poster-2013-331oai:cds.cern.ch:16108492013-10-14
spellingShingle Haen, C
Barra, V
Bonaccorsi, E
Neufeld, N
LHCb: Phronesis, a diagnosis and recovery tool for system administrators
title LHCb: Phronesis, a diagnosis and recovery tool for system administrators
title_full LHCb: Phronesis, a diagnosis and recovery tool for system administrators
title_fullStr LHCb: Phronesis, a diagnosis and recovery tool for system administrators
title_full_unstemmed LHCb: Phronesis, a diagnosis and recovery tool for system administrators
title_short LHCb: Phronesis, a diagnosis and recovery tool for system administrators
title_sort lhcb: phronesis, a diagnosis and recovery tool for system administrators
url http://cds.cern.ch/record/1610849
work_keys_str_mv AT haenc lhcbphronesisadiagnosisandrecoverytoolforsystemadministrators
AT barrav lhcbphronesisadiagnosisandrecoverytoolforsystemadministrators
AT bonaccorsie lhcbphronesisadiagnosisandrecoverytoolforsystemadministrators
AT neufeldn lhcbphronesisadiagnosisandrecoverytoolforsystemadministrators