_version_ 1780925322705764352
author Bauer, Gerry
Behrens, Ulf
Bouffet, Olivier
Bowen, Matthew
Branson, James
Bukowiec, Sebastian Czeslaw
Ciganek, Marek
Cittolin, Sergio
Coarasa Perez, Jose Antonio
Deldicque, Christian
Dobson, Marc
Dupont, Aymeric
Erhan, Samim
Flossdorf, Alexander
Gigi, Dominique
Glege, Frank
Gomez-Reino Garrido, Robert
Hartl, Christian
Hegeman, Jeroen Guido
Holzner, Andre Georg
Hwong, Yi Ling
Masetti, Lorenzo
Meijers, Franciscus
Meschi, Emilio
Mommsen, Remigius
O'Dell, Vivian
Orsini, Luciano
Paus, Christoph Maria Ernst
Petrucci, Andrea
Pieri, Marco
Polese, Giovanni
Racz, Attila
Raginel, Olivier
Sakulin, Hannes
Sani, Matteo
Schwick, Christoph
Shpakov, Denis
Simon, Michal
Spataru, Andrei Cristian
Sumorok, Konstanty
author_facet Bauer, Gerry
Behrens, Ulf
Bouffet, Olivier
Bowen, Matthew
Branson, James
Bukowiec, Sebastian Czeslaw
Ciganek, Marek
Cittolin, Sergio
Coarasa Perez, Jose Antonio
Deldicque, Christian
Dobson, Marc
Dupont, Aymeric
Erhan, Samim
Flossdorf, Alexander
Gigi, Dominique
Glege, Frank
Gomez-Reino Garrido, Robert
Hartl, Christian
Hegeman, Jeroen Guido
Holzner, Andre Georg
Hwong, Yi Ling
Masetti, Lorenzo
Meijers, Franciscus
Meschi, Emilio
Mommsen, Remigius
O'Dell, Vivian
Orsini, Luciano
Paus, Christoph Maria Ernst
Petrucci, Andrea
Pieri, Marco
Polese, Giovanni
Racz, Attila
Raginel, Olivier
Sakulin, Hannes
Sani, Matteo
Schwick, Christoph
Shpakov, Denis
Simon, Michal
Spataru, Andrei Cristian
Sumorok, Konstanty
author_sort Bauer, Gerry
collection CERN
description The CMS experiment's online cluster consists of 2300 computers and 170 switches or routers operating on a 24-hour basis. This huge infrastructure must be monitored in a way that the administrators are pro-actively warned of any failures or degradation in the system, in order to avoid or minimize downtime of the system which can lead to loss of data taking. The number of metrics monitored per host varies from 20 to 40 and covers basic host checks (disk, network, load) to application specific checks (service running) in addition to hardware monitoring. The sheer number of hosts and checks per host in the system stretches the limits of many monitoring tools and requires careful usage of various configuration optimizations to work reliably. The initial monitoring system used in the CMS online cluster was based on Nagios, but suffered from various drawbacks and did not work reliably in the expanded cluster. The CMS cluster administrators investigated the different open source tools available and chose to use a fork of Nagios called Icinga, with several plugin modules to enhance its scalability. The Gearman module provides a queuing system for all checks and their results allowing easy load balancing across worker nodes. Supported modules allow the grouping of checks in one single request thereby significantly reducing the network overhead for doing a set of checks on a group of nodes. The PNP4nagios module provides the graphing capability to Icinga, which uses files as round robin databases (RRD). Additional software (rrdcached) optimizes access to the RRD files and is vital in order to support the required number of operations. Furthermore, to make best use of the monitoring information to notify the appropriate communities of any issues with their systems, much work was put into the grouping of the checks according to, for example, the function of the machine, the services running, the sub-detectors to which they belong, and the criticality of the computer. An automated system to generate the configuration of the monitoring system has been produced to facilitate its evolution and maintenance. The use of these performance enhancing modules and the work on grouping the checks has yielded impressive performance improvements over the previous Nagios infrastructure, allowing for the monitoring of many more metrics per second compared to the previous system. Furthermore the design allows the easy growth of the infrastructure without the need to rethink the monitoring system as a whole.
id cern-1462972
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2012
record_format invenio
spelling cern-14629722019-09-30T06:29:59Zhttp://cds.cern.ch/record/1462972engBauer, GerryBehrens, UlfBouffet, OlivierBowen, MatthewBranson, JamesBukowiec, Sebastian CzeslawCiganek, MarekCittolin, SergioCoarasa Perez, Jose AntonioDeldicque, ChristianDobson, MarcDupont, AymericErhan, SamimFlossdorf, AlexanderGigi, DominiqueGlege, FrankGomez-Reino Garrido, RobertHartl, ChristianHegeman, Jeroen GuidoHolzner, Andre GeorgHwong, Yi LingMasetti, LorenzoMeijers, FranciscusMeschi, EmilioMommsen, RemigiusO'Dell, VivianOrsini, LucianoPaus, Christoph Maria ErnstPetrucci, AndreaPieri, MarcoPolese, GiovanniRacz, AttilaRaginel, OlivierSakulin, HannesSani, MatteoSchwick, ChristophShpakov, DenisSimon, MichalSpataru, Andrei CristianSumorok, KonstantyHealth And Performance Monitoring Of The Online Computer Cluster Of CMSDetectors and Experimental TechniquesThe CMS experiment's online cluster consists of 2300 computers and 170 switches or routers operating on a 24-hour basis. This huge infrastructure must be monitored in a way that the administrators are pro-actively warned of any failures or degradation in the system, in order to avoid or minimize downtime of the system which can lead to loss of data taking. The number of metrics monitored per host varies from 20 to 40 and covers basic host checks (disk, network, load) to application specific checks (service running) in addition to hardware monitoring. The sheer number of hosts and checks per host in the system stretches the limits of many monitoring tools and requires careful usage of various configuration optimizations to work reliably. The initial monitoring system used in the CMS online cluster was based on Nagios, but suffered from various drawbacks and did not work reliably in the expanded cluster. The CMS cluster administrators investigated the different open source tools available and chose to use a fork of Nagios called Icinga, with several plugin modules to enhance its scalability. The Gearman module provides a queuing system for all checks and their results allowing easy load balancing across worker nodes. Supported modules allow the grouping of checks in one single request thereby significantly reducing the network overhead for doing a set of checks on a group of nodes. The PNP4nagios module provides the graphing capability to Icinga, which uses files as round robin databases (RRD). Additional software (rrdcached) optimizes access to the RRD files and is vital in order to support the required number of operations. Furthermore, to make best use of the monitoring information to notify the appropriate communities of any issues with their systems, much work was put into the grouping of the checks according to, for example, the function of the machine, the services running, the sub-detectors to which they belong, and the criticality of the computer. An automated system to generate the configuration of the monitoring system has been produced to facilitate its evolution and maintenance. The use of these performance enhancing modules and the work on grouping the checks has yielded impressive performance improvements over the previous Nagios infrastructure, allowing for the monitoring of many more metrics per second compared to the previous system. Furthermore the design allows the easy growth of the infrastructure without the need to rethink the monitoring system as a whole.CMS-CR-2012-158oai:cds.cern.ch:14629722012-06-20
spellingShingle Detectors and Experimental Techniques
Bauer, Gerry
Behrens, Ulf
Bouffet, Olivier
Bowen, Matthew
Branson, James
Bukowiec, Sebastian Czeslaw
Ciganek, Marek
Cittolin, Sergio
Coarasa Perez, Jose Antonio
Deldicque, Christian
Dobson, Marc
Dupont, Aymeric
Erhan, Samim
Flossdorf, Alexander
Gigi, Dominique
Glege, Frank
Gomez-Reino Garrido, Robert
Hartl, Christian
Hegeman, Jeroen Guido
Holzner, Andre Georg
Hwong, Yi Ling
Masetti, Lorenzo
Meijers, Franciscus
Meschi, Emilio
Mommsen, Remigius
O'Dell, Vivian
Orsini, Luciano
Paus, Christoph Maria Ernst
Petrucci, Andrea
Pieri, Marco
Polese, Giovanni
Racz, Attila
Raginel, Olivier
Sakulin, Hannes
Sani, Matteo
Schwick, Christoph
Shpakov, Denis
Simon, Michal
Spataru, Andrei Cristian
Sumorok, Konstanty
Health And Performance Monitoring Of The Online Computer Cluster Of CMS
title Health And Performance Monitoring Of The Online Computer Cluster Of CMS
title_full Health And Performance Monitoring Of The Online Computer Cluster Of CMS
title_fullStr Health And Performance Monitoring Of The Online Computer Cluster Of CMS
title_full_unstemmed Health And Performance Monitoring Of The Online Computer Cluster Of CMS
title_short Health And Performance Monitoring Of The Online Computer Cluster Of CMS
title_sort health and performance monitoring of the online computer cluster of cms
topic Detectors and Experimental Techniques
url http://cds.cern.ch/record/1462972
work_keys_str_mv AT bauergerry healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT behrensulf healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT bouffetolivier healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT bowenmatthew healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT bransonjames healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT bukowiecsebastianczeslaw healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT ciganekmarek healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT cittolinsergio healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT coarasaperezjoseantonio healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT deldicquechristian healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT dobsonmarc healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT dupontaymeric healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT erhansamim healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT flossdorfalexander healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT gigidominique healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT glegefrank healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT gomezreinogarridorobert healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT hartlchristian healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT hegemanjeroenguido healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT holznerandregeorg healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT hwongyiling healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT masettilorenzo healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT meijersfranciscus healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT meschiemilio healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT mommsenremigius healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT odellvivian healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT orsiniluciano healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT pauschristophmariaernst healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT petrucciandrea healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT pierimarco healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT polesegiovanni healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT raczattila healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT raginelolivier healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT sakulinhannes healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT sanimatteo healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT schwickchristoph healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT shpakovdenis healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT simonmichal healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT spataruandreicristian healthandperformancemonitoringoftheonlinecomputerclusterofcms
AT sumorokkonstanty healthandperformancemonitoringoftheonlinecomputerclusterofcms