Cargando…
Health And Performance Monitoring Of The Online Computer Cluster Of CMS
The CMS experiment's online cluster consists of 2300 computers and 170 switches or routers operating on a 24-hour basis. This huge infrastructure must be monitored in a way that the administrators are pro-actively warned of any failures or degradation in the system, in order to avoid or minimiz...
Autores principales: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2012
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1462972 |
_version_ | 1780925322705764352 |
---|---|
author | Bauer, Gerry Behrens, Ulf Bouffet, Olivier Bowen, Matthew Branson, James Bukowiec, Sebastian Czeslaw Ciganek, Marek Cittolin, Sergio Coarasa Perez, Jose Antonio Deldicque, Christian Dobson, Marc Dupont, Aymeric Erhan, Samim Flossdorf, Alexander Gigi, Dominique Glege, Frank Gomez-Reino Garrido, Robert Hartl, Christian Hegeman, Jeroen Guido Holzner, Andre Georg Hwong, Yi Ling Masetti, Lorenzo Meijers, Franciscus Meschi, Emilio Mommsen, Remigius O'Dell, Vivian Orsini, Luciano Paus, Christoph Maria Ernst Petrucci, Andrea Pieri, Marco Polese, Giovanni Racz, Attila Raginel, Olivier Sakulin, Hannes Sani, Matteo Schwick, Christoph Shpakov, Denis Simon, Michal Spataru, Andrei Cristian Sumorok, Konstanty |
author_facet | Bauer, Gerry Behrens, Ulf Bouffet, Olivier Bowen, Matthew Branson, James Bukowiec, Sebastian Czeslaw Ciganek, Marek Cittolin, Sergio Coarasa Perez, Jose Antonio Deldicque, Christian Dobson, Marc Dupont, Aymeric Erhan, Samim Flossdorf, Alexander Gigi, Dominique Glege, Frank Gomez-Reino Garrido, Robert Hartl, Christian Hegeman, Jeroen Guido Holzner, Andre Georg Hwong, Yi Ling Masetti, Lorenzo Meijers, Franciscus Meschi, Emilio Mommsen, Remigius O'Dell, Vivian Orsini, Luciano Paus, Christoph Maria Ernst Petrucci, Andrea Pieri, Marco Polese, Giovanni Racz, Attila Raginel, Olivier Sakulin, Hannes Sani, Matteo Schwick, Christoph Shpakov, Denis Simon, Michal Spataru, Andrei Cristian Sumorok, Konstanty |
author_sort | Bauer, Gerry |
collection | CERN |
description | The CMS experiment's online cluster consists of 2300 computers and 170 switches or routers
operating on a 24-hour basis. This huge infrastructure must be monitored in a way that the administrators
are pro-actively warned of any failures or degradation in the system, in order to avoid or minimize
downtime of the system which can lead to loss of data taking. The number of metrics monitored per host
varies from 20 to 40 and covers basic host checks (disk, network, load) to application specific checks
(service running) in addition to hardware monitoring. The sheer number of hosts and checks per host in
the system stretches the limits of many monitoring tools and requires careful usage of various
configuration optimizations to work reliably. The initial monitoring system used in the CMS online
cluster was based on Nagios, but suffered from various drawbacks and did not work reliably in the
expanded cluster. The CMS cluster administrators investigated the different open source tools available
and chose to use a fork of Nagios called Icinga, with several plugin modules to enhance its scalability.
The Gearman module provides a queuing system for all checks and their results allowing easy load
balancing across worker nodes. Supported modules allow the grouping of checks in one single request
thereby significantly reducing the network overhead for doing a set of checks on a group of nodes. The
PNP4nagios module provides the graphing capability to Icinga, which uses files as round robin databases
(RRD). Additional software (rrdcached) optimizes access to the RRD files and is vital in order to support
the required number of operations. Furthermore, to make best use of the monitoring information to notify
the appropriate communities of any issues with their systems, much work was put into the grouping of the
checks according to, for example, the function of the machine, the services running, the sub-detectors to
which they belong, and the criticality of the computer. An automated system to generate the configuration
of the monitoring system has been produced to facilitate its evolution and maintenance. The use of these
performance enhancing modules and the work on grouping the checks has yielded impressive
performance improvements over the previous Nagios infrastructure, allowing for the monitoring of many
more metrics per second compared to the previous system. Furthermore the design allows the easy growth
of the infrastructure without the need to rethink the monitoring system as a whole. |
id | cern-1462972 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2012 |
record_format | invenio |
spelling | cern-14629722019-09-30T06:29:59Zhttp://cds.cern.ch/record/1462972engBauer, GerryBehrens, UlfBouffet, OlivierBowen, MatthewBranson, JamesBukowiec, Sebastian CzeslawCiganek, MarekCittolin, SergioCoarasa Perez, Jose AntonioDeldicque, ChristianDobson, MarcDupont, AymericErhan, SamimFlossdorf, AlexanderGigi, DominiqueGlege, FrankGomez-Reino Garrido, RobertHartl, ChristianHegeman, Jeroen GuidoHolzner, Andre GeorgHwong, Yi LingMasetti, LorenzoMeijers, FranciscusMeschi, EmilioMommsen, RemigiusO'Dell, VivianOrsini, LucianoPaus, Christoph Maria ErnstPetrucci, AndreaPieri, MarcoPolese, GiovanniRacz, AttilaRaginel, OlivierSakulin, HannesSani, MatteoSchwick, ChristophShpakov, DenisSimon, MichalSpataru, Andrei CristianSumorok, KonstantyHealth And Performance Monitoring Of The Online Computer Cluster Of CMSDetectors and Experimental TechniquesThe CMS experiment's online cluster consists of 2300 computers and 170 switches or routers operating on a 24-hour basis. This huge infrastructure must be monitored in a way that the administrators are pro-actively warned of any failures or degradation in the system, in order to avoid or minimize downtime of the system which can lead to loss of data taking. The number of metrics monitored per host varies from 20 to 40 and covers basic host checks (disk, network, load) to application specific checks (service running) in addition to hardware monitoring. The sheer number of hosts and checks per host in the system stretches the limits of many monitoring tools and requires careful usage of various configuration optimizations to work reliably. The initial monitoring system used in the CMS online cluster was based on Nagios, but suffered from various drawbacks and did not work reliably in the expanded cluster. The CMS cluster administrators investigated the different open source tools available and chose to use a fork of Nagios called Icinga, with several plugin modules to enhance its scalability. The Gearman module provides a queuing system for all checks and their results allowing easy load balancing across worker nodes. Supported modules allow the grouping of checks in one single request thereby significantly reducing the network overhead for doing a set of checks on a group of nodes. The PNP4nagios module provides the graphing capability to Icinga, which uses files as round robin databases (RRD). Additional software (rrdcached) optimizes access to the RRD files and is vital in order to support the required number of operations. Furthermore, to make best use of the monitoring information to notify the appropriate communities of any issues with their systems, much work was put into the grouping of the checks according to, for example, the function of the machine, the services running, the sub-detectors to which they belong, and the criticality of the computer. An automated system to generate the configuration of the monitoring system has been produced to facilitate its evolution and maintenance. The use of these performance enhancing modules and the work on grouping the checks has yielded impressive performance improvements over the previous Nagios infrastructure, allowing for the monitoring of many more metrics per second compared to the previous system. Furthermore the design allows the easy growth of the infrastructure without the need to rethink the monitoring system as a whole.CMS-CR-2012-158oai:cds.cern.ch:14629722012-06-20 |
spellingShingle | Detectors and Experimental Techniques Bauer, Gerry Behrens, Ulf Bouffet, Olivier Bowen, Matthew Branson, James Bukowiec, Sebastian Czeslaw Ciganek, Marek Cittolin, Sergio Coarasa Perez, Jose Antonio Deldicque, Christian Dobson, Marc Dupont, Aymeric Erhan, Samim Flossdorf, Alexander Gigi, Dominique Glege, Frank Gomez-Reino Garrido, Robert Hartl, Christian Hegeman, Jeroen Guido Holzner, Andre Georg Hwong, Yi Ling Masetti, Lorenzo Meijers, Franciscus Meschi, Emilio Mommsen, Remigius O'Dell, Vivian Orsini, Luciano Paus, Christoph Maria Ernst Petrucci, Andrea Pieri, Marco Polese, Giovanni Racz, Attila Raginel, Olivier Sakulin, Hannes Sani, Matteo Schwick, Christoph Shpakov, Denis Simon, Michal Spataru, Andrei Cristian Sumorok, Konstanty Health And Performance Monitoring Of The Online Computer Cluster Of CMS |
title | Health And Performance Monitoring Of The Online Computer Cluster Of CMS |
title_full | Health And Performance Monitoring Of The Online Computer Cluster Of CMS |
title_fullStr | Health And Performance Monitoring Of The Online Computer Cluster Of CMS |
title_full_unstemmed | Health And Performance Monitoring Of The Online Computer Cluster Of CMS |
title_short | Health And Performance Monitoring Of The Online Computer Cluster Of CMS |
title_sort | health and performance monitoring of the online computer cluster of cms |
topic | Detectors and Experimental Techniques |
url | http://cds.cern.ch/record/1462972 |
work_keys_str_mv | AT bauergerry healthandperformancemonitoringoftheonlinecomputerclusterofcms AT behrensulf healthandperformancemonitoringoftheonlinecomputerclusterofcms AT bouffetolivier healthandperformancemonitoringoftheonlinecomputerclusterofcms AT bowenmatthew healthandperformancemonitoringoftheonlinecomputerclusterofcms AT bransonjames healthandperformancemonitoringoftheonlinecomputerclusterofcms AT bukowiecsebastianczeslaw healthandperformancemonitoringoftheonlinecomputerclusterofcms AT ciganekmarek healthandperformancemonitoringoftheonlinecomputerclusterofcms AT cittolinsergio healthandperformancemonitoringoftheonlinecomputerclusterofcms AT coarasaperezjoseantonio healthandperformancemonitoringoftheonlinecomputerclusterofcms AT deldicquechristian healthandperformancemonitoringoftheonlinecomputerclusterofcms AT dobsonmarc healthandperformancemonitoringoftheonlinecomputerclusterofcms AT dupontaymeric healthandperformancemonitoringoftheonlinecomputerclusterofcms AT erhansamim healthandperformancemonitoringoftheonlinecomputerclusterofcms AT flossdorfalexander healthandperformancemonitoringoftheonlinecomputerclusterofcms AT gigidominique healthandperformancemonitoringoftheonlinecomputerclusterofcms AT glegefrank healthandperformancemonitoringoftheonlinecomputerclusterofcms AT gomezreinogarridorobert healthandperformancemonitoringoftheonlinecomputerclusterofcms AT hartlchristian healthandperformancemonitoringoftheonlinecomputerclusterofcms AT hegemanjeroenguido healthandperformancemonitoringoftheonlinecomputerclusterofcms AT holznerandregeorg healthandperformancemonitoringoftheonlinecomputerclusterofcms AT hwongyiling healthandperformancemonitoringoftheonlinecomputerclusterofcms AT masettilorenzo healthandperformancemonitoringoftheonlinecomputerclusterofcms AT meijersfranciscus healthandperformancemonitoringoftheonlinecomputerclusterofcms AT meschiemilio healthandperformancemonitoringoftheonlinecomputerclusterofcms AT mommsenremigius healthandperformancemonitoringoftheonlinecomputerclusterofcms AT odellvivian healthandperformancemonitoringoftheonlinecomputerclusterofcms AT orsiniluciano healthandperformancemonitoringoftheonlinecomputerclusterofcms AT pauschristophmariaernst healthandperformancemonitoringoftheonlinecomputerclusterofcms AT petrucciandrea healthandperformancemonitoringoftheonlinecomputerclusterofcms AT pierimarco healthandperformancemonitoringoftheonlinecomputerclusterofcms AT polesegiovanni healthandperformancemonitoringoftheonlinecomputerclusterofcms AT raczattila healthandperformancemonitoringoftheonlinecomputerclusterofcms AT raginelolivier healthandperformancemonitoringoftheonlinecomputerclusterofcms AT sakulinhannes healthandperformancemonitoringoftheonlinecomputerclusterofcms AT sanimatteo healthandperformancemonitoringoftheonlinecomputerclusterofcms AT schwickchristoph healthandperformancemonitoringoftheonlinecomputerclusterofcms AT shpakovdenis healthandperformancemonitoringoftheonlinecomputerclusterofcms AT simonmichal healthandperformancemonitoringoftheonlinecomputerclusterofcms AT spataruandreicristian healthandperformancemonitoringoftheonlinecomputerclusterofcms AT sumorokkonstanty healthandperformancemonitoringoftheonlinecomputerclusterofcms |