Cargando…

Monitoring techniques and alarm procedures for CMS services and sites in WLCG

The CMS offline computing system is composed of roughly 80 sites (including most experienced T3s) and a number of central services to distribute, process and analyze data worldwide. A high level of stability and reliability is required from the underlying infrastructure and services, partially cove...

Descripción completa

Detalles Bibliográficos
Autor principal:	Molina-Perez, Jorge Amando
Lenguaje:	eng
Publicado:	2012
Materias:	Detectors and Experimental Techniques
Acceso en línea:	http://cds.cern.ch/record/1457782

_version_	1780925133562576896
author	Molina-Perez, Jorge Amando
author_facet	Molina-Perez, Jorge Amando
author_sort	Molina-Perez, Jorge Amando
collection	CERN
description	The CMS offline computing system is composed of roughly 80 sites (including most experienced T3s) and a number of central services to distribute, process and analyze data worldwide. A high level of stability and reliability is required from the underlying infrastructure and services, partially covered by local or automated monitoring and alarming systems such as Lemon and SLS; the former collects metrics from sensors installed on computing nodes and triggers alarms when values are out of range, the latter measures the quality of service and warns managers when service is affected. CMS has established computing shift procedures with personnel operating worldwide from remote Computing Centers, under the supervision of the Computing Run Coordinator on duty at CERN. This dedicated 24/7 computing shift personnel is contributing to detect and react timely on any unexpected error and hence ensure that CMS workflows are carried out efficiently and in a sustained manner. Synergy among all the involved actors is exploited to ensure the 24/7 monitoring, alarming and troubleshooting of the CMS computing sites and services. We review the deployment of the monitoring and alarming procedures, and report on the experience gained throughout the first 2 years of LHC operation. We describe the efficiency of the communication tools employed, the coherent monitoring framework, the pro-active alarming systems and the proficient troubleshooting procedures that helped the CMS Computing facilities and infrastructure to operate at high reliability levels.
id	cern-1457782
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2012
record_format	invenio
spelling	cern-14577822019-09-30T06:29:59Zhttp://cds.cern.ch/record/1457782engMolina-Perez, Jorge AmandoMonitoring techniques and alarm procedures for CMS services and sites in WLCGDetectors and Experimental TechniquesThe CMS offline computing system is composed of roughly 80 sites (including most experienced T3s) and a number of central services to distribute, process and analyze data worldwide. A high level of stability and reliability is required from the underlying infrastructure and services, partially covered by local or automated monitoring and alarming systems such as Lemon and SLS; the former collects metrics from sensors installed on computing nodes and triggers alarms when values are out of range, the latter measures the quality of service and warns managers when service is affected. CMS has established computing shift procedures with personnel operating worldwide from remote Computing Centers, under the supervision of the Computing Run Coordinator on duty at CERN. This dedicated 24/7 computing shift personnel is contributing to detect and react timely on any unexpected error and hence ensure that CMS workflows are carried out efficiently and in a sustained manner. Synergy among all the involved actors is exploited to ensure the 24/7 monitoring, alarming and troubleshooting of the CMS computing sites and services. We review the deployment of the monitoring and alarming procedures, and report on the experience gained throughout the first 2 years of LHC operation. We describe the efficiency of the communication tools employed, the coherent monitoring framework, the pro-active alarming systems and the proficient troubleshooting procedures that helped the CMS Computing facilities and infrastructure to operate at high reliability levels.CMS-CR-2012-100oai:cds.cern.ch:14577822012-05-15
spellingShingle	Detectors and Experimental Techniques Molina-Perez, Jorge Amando Monitoring techniques and alarm procedures for CMS services and sites in WLCG
title	Monitoring techniques and alarm procedures for CMS services and sites in WLCG
title_full	Monitoring techniques and alarm procedures for CMS services and sites in WLCG
title_fullStr	Monitoring techniques and alarm procedures for CMS services and sites in WLCG
title_full_unstemmed	Monitoring techniques and alarm procedures for CMS services and sites in WLCG
title_short	Monitoring techniques and alarm procedures for CMS services and sites in WLCG
title_sort	monitoring techniques and alarm procedures for cms services and sites in wlcg
topic	Detectors and Experimental Techniques
url	http://cds.cern.ch/record/1457782
work_keys_str_mv	AT molinaperezjorgeamando monitoringtechniquesandalarmproceduresforcmsservicesandsitesinwlcg

Monitoring techniques and alarm procedures for CMS services and sites in WLCG

Ejemplares similares