Cargando…

Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud

Frequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner...

Descripción completa

Detalles Bibliográficos
Autores principales: Van der Ster , D, Elmsheuser , J, Medrano Llamas, R, Legger , F, Sciaba, A, Sciacca, G, Ubeda Garca , M
Lenguaje:eng
Publicado: 2012
Materias:
Acceso en línea:http://cds.cern.ch/record/1457967
_version_ 1780925147093401600
author Van der Ster , D
Elmsheuser , J
Medrano Llamas, R
Legger , F
Sciaba, A
Sciacca, G
Ubeda Garca , M
author_facet Van der Ster , D
Elmsheuser , J
Medrano Llamas, R
Legger , F
Sciaba, A
Sciacca, G
Ubeda Garca , M
author_sort Van der Ster , D
collection CERN
description Frequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner. The ATLAS, CMS and LHCb experiments have all developed VO plugins for the service and have successfully integrated it into their grid operations infrastructures. This work will present the experience in running HammerCloud at full scale for more than 3 years and present solutions to the scalability issues faced by the service. First, we will show the particular challenges faced when integrating with CMS and LHCb offline computing, including customized dashboards to show site validation reports for the VOs and a new API to tightly integrate with the LHCbDIRAC Resource Status System. Next, a study of the automatic site exclusion component used by ATLAS will be presented along with results for tuning the exclusion policies. A study of the historical test results for ATLAS, CMS and LHCb will be presented, including comparisons between the experiments' grid availabilities and a search for site-based or temporal failure correlations. Finally, we will look to future plans that will allow users to gain new insights into the test results; these include developments to allow increased testing concurrency, increased scale in the number of metrics recorded per test job (up to hundreds), and increased scale in the historical job information (up to many millions of jobs per VO).
id cern-1457967
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2012
record_format invenio
spelling cern-14579672022-08-17T13:32:57Zhttp://cds.cern.ch/record/1457967engVan der Ster , DElmsheuser , JMedrano Llamas, RLegger , FSciaba, ASciacca, GUbeda Garca , MExperience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloudComputing and ComputersFrequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner. The ATLAS, CMS and LHCb experiments have all developed VO plugins for the service and have successfully integrated it into their grid operations infrastructures. This work will present the experience in running HammerCloud at full scale for more than 3 years and present solutions to the scalability issues faced by the service. First, we will show the particular challenges faced when integrating with CMS and LHCb offline computing, including customized dashboards to show site validation reports for the VOs and a new API to tightly integrate with the LHCbDIRAC Resource Status System. Next, a study of the automatic site exclusion component used by ATLAS will be presented along with results for tuning the exclusion policies. A study of the historical test results for ATLAS, CMS and LHCb will be presented, including comparisons between the experiments' grid availabilities and a search for site-based or temporal failure correlations. Finally, we will look to future plans that will allow users to gain new insights into the test results; these include developments to allow increased testing concurrency, increased scale in the number of metrics recorded per test job (up to hundreds), and increased scale in the historical job information (up to many millions of jobs per VO).CERN-IT-Note-2012-009oai:cds.cern.ch:14579672012-05-16
spellingShingle Computing and Computers
Van der Ster , D
Elmsheuser , J
Medrano Llamas, R
Legger , F
Sciaba, A
Sciacca, G
Ubeda Garca , M
Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
title Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
title_full Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
title_fullStr Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
title_full_unstemmed Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
title_short Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
title_sort experience in grid site testing for atlas, cms and lhcb with hammercloud
topic Computing and Computers
url http://cds.cern.ch/record/1457967
work_keys_str_mv AT vandersterd experienceingridsitetestingforatlascmsandlhcbwithhammercloud
AT elmsheuserj experienceingridsitetestingforatlascmsandlhcbwithhammercloud
AT medranollamasr experienceingridsitetestingforatlascmsandlhcbwithhammercloud
AT leggerf experienceingridsitetestingforatlascmsandlhcbwithhammercloud
AT sciabaa experienceingridsitetestingforatlascmsandlhcbwithhammercloud
AT sciaccag experienceingridsitetestingforatlascmsandlhcbwithhammercloud
AT ubedagarcam experienceingridsitetestingforatlascmsandlhcbwithhammercloud