Cargando…

Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs

In the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availabil...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dimou, M, Dres, H, Dulov, O, Grein, G
Lenguaje:	eng
Publicado:	2013
Acceso en línea:	http://cds.cern.ch/record/1622238

_version_	1780933231769550848
author	Dimou, M Dres, H Dulov, O Grein, G
author_facet	Dimou, M Dres, H Dulov, O Grein, G
author_sort	Dimou, M
collection	CERN
description	In the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availability of resources at the Tier0 or the Tier1s can seriously harm not only WLCG Operations but also the experiments' workflow and the storage of LHC data which are very expensive to reproduce. This is why availability requirements for these sites are high and committed in the WLCG Memorandum of Understanding (MoU). In this talk we describe the workflow of GGUS ALARMs, the only 24/7 mechanism available to LHC experiment experts for reporting to the Tier0 or the Tier1s problems with their Critical Services. Conclusions and experience gained from the detailed drills performed in each such ALARM for the last 4 years are explained and the shift with time of Type of Problems met. The physical infrastructure put in place to achieve GGUS 24/7 availability are summarised.
id	cern-1622238
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2013
record_format	invenio
spelling	cern-16222382019-09-30T06:29:59Zhttp://cds.cern.ch/record/1622238engDimou, MDres, HDulov, OGrein, GHandling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMsIn the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availability of resources at the Tier0 or the Tier1s can seriously harm not only WLCG Operations but also the experiments' workflow and the storage of LHC data which are very expensive to reproduce. This is why availability requirements for these sites are high and committed in the WLCG Memorandum of Understanding (MoU). In this talk we describe the workflow of GGUS ALARMs, the only 24/7 mechanism available to LHC experiment experts for reporting to the Tier0 or the Tier1s problems with their Critical Services. Conclusions and experience gained from the detailed drills performed in each such ALARM for the last 4 years are explained and the shift with time of Type of Problems met. The physical infrastructure put in place to achieve GGUS 24/7 availability are summarised.Poster-2013-391oai:cds.cern.ch:16222382013-10-07
spellingShingle	Dimou, M Dres, H Dulov, O Grein, G Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title	Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_full	Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_fullStr	Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_full_unstemmed	Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_short	Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_sort	handling worldwide lhc computing grid critical service incidents : the infrastructure and experience behind nearly 5 years of ggus alarms
url	http://cds.cern.ch/record/1622238
work_keys_str_mv	AT dimoum handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms AT dresh handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms AT dulovo handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms AT greing handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms

Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs

Ejemplares similares