Cargando…

Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs

In the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availabil...

Descripción completa

Detalles Bibliográficos
Autores principales: Dimou, M, Dres, H, Dulov, O, Grein, G
Lenguaje:eng
Publicado: 2013
Acceso en línea:http://cds.cern.ch/record/1622238
_version_ 1780933231769550848
author Dimou, M
Dres, H
Dulov, O
Grein, G
author_facet Dimou, M
Dres, H
Dulov, O
Grein, G
author_sort Dimou, M
collection CERN
description In the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availability of resources at the Tier0 or the Tier1s can seriously harm not only WLCG Operations but also the experiments' workflow and the storage of LHC data which are very expensive to reproduce. This is why availability requirements for these sites are high and committed in the WLCG Memorandum of Understanding (MoU). In this talk we describe the workflow of GGUS ALARMs, the only 24/7 mechanism available to LHC experiment experts for reporting to the Tier0 or the Tier1s problems with their Critical Services. Conclusions and experience gained from the detailed drills performed in each such ALARM for the last 4 years are explained and the shift with time of Type of Problems met. The physical infrastructure put in place to achieve GGUS 24/7 availability are summarised.
id cern-1622238
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2013
record_format invenio
spelling cern-16222382019-09-30T06:29:59Zhttp://cds.cern.ch/record/1622238engDimou, MDres, HDulov, OGrein, GHandling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMsIn the Wordwide LHC Computing Grid (WLCG) project the Tier centres are of paramount importance for storing and accessing experiment data and for running the batch jobs necessary for experiment production activities. Although Tier2 sites provide a significant fraction of the resources a non-availability of resources at the Tier0 or the Tier1s can seriously harm not only WLCG Operations but also the experiments' workflow and the storage of LHC data which are very expensive to reproduce. This is why availability requirements for these sites are high and committed in the WLCG Memorandum of Understanding (MoU). In this talk we describe the workflow of GGUS ALARMs, the only 24/7 mechanism available to LHC experiment experts for reporting to the Tier0 or the Tier1s problems with their Critical Services. Conclusions and experience gained from the detailed drills performed in each such ALARM for the last 4 years are explained and the shift with time of Type of Problems met. The physical infrastructure put in place to achieve GGUS 24/7 availability are summarised.Poster-2013-391oai:cds.cern.ch:16222382013-10-07
spellingShingle Dimou, M
Dres, H
Dulov, O
Grein, G
Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_full Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_fullStr Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_full_unstemmed Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_short Handling Worldwide LHC Computing Grid Critical Service Incidents : The infrastructure and experience behind nearly 5 years of GGUS ALARMs
title_sort handling worldwide lhc computing grid critical service incidents : the infrastructure and experience behind nearly 5 years of ggus alarms
url http://cds.cern.ch/record/1622238
work_keys_str_mv AT dimoum handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms
AT dresh handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms
AT dulovo handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms
AT greing handlingworldwidelhccomputinggridcriticalserviceincidentstheinfrastructureandexperiencebehindnearly5yearsofggusalarms