Cargando…

Integrated monitoring of the ATLAS online computing farm

The online farm of the ATLAS experiment at the LHC, consisting of nearly 4000 PCs with various characteristics, provides configuration and control of the detector and performs the collection, processing, selection and conveyance of event data from the front-end electronics to mass storage. The statu...

Descripción completa

Detalles Bibliográficos
Autores principales: Ballestrero, Sergio, Brasolin, Franco, Fazio, Daniel, Gament, Costin-Eugen, Lee, Christopher, Scannicchio, Diana, Twomey, Matthew Shaun
Lenguaje:eng
Publicado: 2016
Materias:
Acceso en línea:http://cds.cern.ch/record/2221693
_version_ 1780952243203211264
author Ballestrero, Sergio
Brasolin, Franco
Fazio, Daniel
Gament, Costin-Eugen
Lee, Christopher
Scannicchio, Diana
Twomey, Matthew Shaun
author_facet Ballestrero, Sergio
Brasolin, Franco
Fazio, Daniel
Gament, Costin-Eugen
Lee, Christopher
Scannicchio, Diana
Twomey, Matthew Shaun
author_sort Ballestrero, Sergio
collection CERN
description The online farm of the ATLAS experiment at the LHC, consisting of nearly 4000 PCs with various characteristics, provides configuration and control of the detector and performs the collection, processing, selection and conveyance of event data from the front-end electronics to mass storage. The status and health of every host must be constantly monitored to ensure the correct and reliable operation of the whole online system. This is the first line of defense, which should not only promptly provide alerts in case of failure but, whenever possible, warn of impending issues. The monitoring system should be able to check up to 100000 health parameters and provide alerts on a selected subset. In this paper we present the implementation and validation of our new monitoring and alerting system based on Icinga 2 and Ganglia. We describe how the load distribution and high availability features of Icinga 2 allowed us to have a centralised but scalable system, with a configuration model that allows full flexibility while still guaranteeing complete farm coverage. Finally, we cover the integration of Icinga 2 with Ganglia and other data sources, such as SNMP for system information and IPMI for hardware health.
id cern-2221693
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2016
record_format invenio
spelling cern-22216932019-09-30T06:29:59Zhttp://cds.cern.ch/record/2221693engBallestrero, SergioBrasolin, FrancoFazio, DanielGament, Costin-EugenLee, ChristopherScannicchio, DianaTwomey, Matthew ShaunIntegrated monitoring of the ATLAS online computing farmParticle Physics - ExperimentThe online farm of the ATLAS experiment at the LHC, consisting of nearly 4000 PCs with various characteristics, provides configuration and control of the detector and performs the collection, processing, selection and conveyance of event data from the front-end electronics to mass storage. The status and health of every host must be constantly monitored to ensure the correct and reliable operation of the whole online system. This is the first line of defense, which should not only promptly provide alerts in case of failure but, whenever possible, warn of impending issues. The monitoring system should be able to check up to 100000 health parameters and provide alerts on a selected subset. In this paper we present the implementation and validation of our new monitoring and alerting system based on Icinga 2 and Ganglia. We describe how the load distribution and high availability features of Icinga 2 allowed us to have a centralised but scalable system, with a configuration model that allows full flexibility while still guaranteeing complete farm coverage. Finally, we cover the integration of Icinga 2 with Ganglia and other data sources, such as SNMP for system information and IPMI for hardware health.ATL-DAQ-SLIDE-2016-765oai:cds.cern.ch:22216932016-10-04
spellingShingle Particle Physics - Experiment
Ballestrero, Sergio
Brasolin, Franco
Fazio, Daniel
Gament, Costin-Eugen
Lee, Christopher
Scannicchio, Diana
Twomey, Matthew Shaun
Integrated monitoring of the ATLAS online computing farm
title Integrated monitoring of the ATLAS online computing farm
title_full Integrated monitoring of the ATLAS online computing farm
title_fullStr Integrated monitoring of the ATLAS online computing farm
title_full_unstemmed Integrated monitoring of the ATLAS online computing farm
title_short Integrated monitoring of the ATLAS online computing farm
title_sort integrated monitoring of the atlas online computing farm
topic Particle Physics - Experiment
url http://cds.cern.ch/record/2221693
work_keys_str_mv AT ballestrerosergio integratedmonitoringoftheatlasonlinecomputingfarm
AT brasolinfranco integratedmonitoringoftheatlasonlinecomputingfarm
AT faziodaniel integratedmonitoringoftheatlasonlinecomputingfarm
AT gamentcostineugen integratedmonitoringoftheatlasonlinecomputingfarm
AT leechristopher integratedmonitoringoftheatlasonlinecomputingfarm
AT scannicchiodiana integratedmonitoringoftheatlasonlinecomputingfarm
AT twomeymatthewshaun integratedmonitoringoftheatlasonlinecomputingfarm