Cargando…
Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
Frequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner...
Autores principales: | , , , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2012
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1457967 |
_version_ | 1780925147093401600 |
---|---|
author | Van der Ster , D Elmsheuser , J Medrano Llamas, R Legger , F Sciaba, A Sciacca, G Ubeda Garca , M |
author_facet | Van der Ster , D Elmsheuser , J Medrano Llamas, R Legger , F Sciaba, A Sciacca, G Ubeda Garca , M |
author_sort | Van der Ster , D |
collection | CERN |
description | Frequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner. The ATLAS, CMS and LHCb experiments have all developed VO plugins for the service and have successfully integrated it into their grid operations infrastructures. This work will present the experience in running HammerCloud at full scale for more than 3 years and present solutions to the scalability issues faced by the service. First, we will show the particular challenges faced when integrating with CMS and LHCb offline computing, including customized dashboards to show site validation reports for the VOs and a new API to tightly integrate with the LHCbDIRAC Resource Status System. Next, a study of the automatic site exclusion component used by ATLAS will be presented along with results for tuning the exclusion policies. A study of the historical test results for ATLAS, CMS and LHCb will be presented, including comparisons between the experiments' grid availabilities and a search for site-based or temporal failure correlations. Finally, we will look to future plans that will allow users to gain new insights into the test results; these include developments to allow increased testing concurrency, increased scale in the number of metrics recorded per test job (up to hundreds), and increased scale in the historical job information (up to many millions of jobs per VO). |
id | cern-1457967 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2012 |
record_format | invenio |
spelling | cern-14579672022-08-17T13:32:57Zhttp://cds.cern.ch/record/1457967engVan der Ster , DElmsheuser , JMedrano Llamas, RLegger , FSciaba, ASciacca, GUbeda Garca , MExperience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloudComputing and ComputersFrequent validation and stress testing of the network, storage and CPU resources of a grid site is essential to achieve high performance and reliability. HammerCloud was previously introduced with the goals of enabling VO- and site-administrators to run such tests in an automated or on-demand manner. The ATLAS, CMS and LHCb experiments have all developed VO plugins for the service and have successfully integrated it into their grid operations infrastructures. This work will present the experience in running HammerCloud at full scale for more than 3 years and present solutions to the scalability issues faced by the service. First, we will show the particular challenges faced when integrating with CMS and LHCb offline computing, including customized dashboards to show site validation reports for the VOs and a new API to tightly integrate with the LHCbDIRAC Resource Status System. Next, a study of the automatic site exclusion component used by ATLAS will be presented along with results for tuning the exclusion policies. A study of the historical test results for ATLAS, CMS and LHCb will be presented, including comparisons between the experiments' grid availabilities and a search for site-based or temporal failure correlations. Finally, we will look to future plans that will allow users to gain new insights into the test results; these include developments to allow increased testing concurrency, increased scale in the number of metrics recorded per test job (up to hundreds), and increased scale in the historical job information (up to many millions of jobs per VO).CERN-IT-Note-2012-009oai:cds.cern.ch:14579672012-05-16 |
spellingShingle | Computing and Computers Van der Ster , D Elmsheuser , J Medrano Llamas, R Legger , F Sciaba, A Sciacca, G Ubeda Garca , M Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud |
title | Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud |
title_full | Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud |
title_fullStr | Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud |
title_full_unstemmed | Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud |
title_short | Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud |
title_sort | experience in grid site testing for atlas, cms and lhcb with hammercloud |
topic | Computing and Computers |
url | http://cds.cern.ch/record/1457967 |
work_keys_str_mv | AT vandersterd experienceingridsitetestingforatlascmsandlhcbwithhammercloud AT elmsheuserj experienceingridsitetestingforatlascmsandlhcbwithhammercloud AT medranollamasr experienceingridsitetestingforatlascmsandlhcbwithhammercloud AT leggerf experienceingridsitetestingforatlascmsandlhcbwithhammercloud AT sciabaa experienceingridsitetestingforatlascmsandlhcbwithhammercloud AT sciaccag experienceingridsitetestingforatlascmsandlhcbwithhammercloud AT ubedagarcam experienceingridsitetestingforatlascmsandlhcbwithhammercloud |