Cargando…

New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: the experiments experience

Since several years the LHC experiments rely on the WLCG Service Availability Monitoring framework (SAM) to run functional tests on their distributed computing systems. The SAM tests have become an essential tool to measure the reliability of the Grid infrastructure and to ensure reliable computing...

Descripción completa

Detalles Bibliográficos
Autores principales: Andreeva, J, Dhara, P, Di Girolamo, A, Kakkar, A, Litmaath, M, Magini, N, Negri, G, Ramachandran, S, Roiser, S, Saiz, P, Saiz Santos, M D, Sarkar, B, Schovancova, J, Sciabà, A, Wakankar, A
Lenguaje:eng
Publicado: 2012
Materias:
Acceso en línea:http://cds.cern.ch/record/1458013
Descripción
Sumario:Since several years the LHC experiments rely on the WLCG Service Availability Monitoring framework (SAM) to run functional tests on their distributed computing systems. The SAM tests have become an essential tool to measure the reliability of the Grid infrastructure and to ensure reliable computing operations, both for the sites and the experiments. Recently the old SAM framework was replaced with a completely new system based on Nagios and ActiveMQ to better support the transition to EGI and to its more distributed infrastructure support model and to implement several scalability and functionality enhancements. This required all LHC experiments and the WLCG support teams to migrate their tests, to acquire expertise on the new system, to validate the new availability and reliability computations and to adopt new visualisation tools. In this contribution we describe in detail the current state of the art of functional testing in WLCG: how the experiments use the new SAM/Nagios framework, the advanced functionality made available by the new framework and the future developments that are foreseen, with a strong focus on the improvements in terms of stability and flexibility brought by the new system.