Cargando…

Grid reliability

We are offering a system to track the efficiency of different components of the GRID. We can study the performance of both the WMS and the data transfers At the moment, we have set different parts of the system for ALICE, ATLAS, CMS and LHCb. None of the components that we have developed are VO spec...

Descripción completa

Detalles Bibliográficos
Autores principales:	Saiz, P, Gaidioz, B, Rocha, R, Andreeva, J
Lenguaje:	eng
Publicado:	2007
Materias:	Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/1120924

Descripción
Sumario:	We are offering a system to track the efficiency of different components of the GRID. We can study the performance of both the WMS and the data transfers At the moment, we have set different parts of the system for ALICE, ATLAS, CMS and LHCb. None of the components that we have developed are VO specific, therefore it would be very easy to deploy them for any other VO. Our main goal is basically to improve the reliability of the GRID. The main idea is to discover as soon as possible the different problems that have happened, and inform the responsible. Since we study the jobs and transfers issued by real users, we see the same problems that users see. As a matter of fact, we see even more problems than the end user does, since we are also interested in following up the errors that GRID components can overcome by themselves (like for instance, in case of a job failure, resubmitting the job to a different site). This kind of information is very useful to site and VO administrators. They can find out the efficiency of their sites, and, in case of failures, the problems that they have to solve. The reports that we provide are also interesting for the COD, since the errors might not be VO specific. All this system is based on studying the different actions that users do. Therefore, the first and most important dependency is on monitoring systems. The way we do it is to interface it with the DASHBOARD, which will hide the differences between the heterogeneous sources of data (like RGMA, ICXML or MonALISA). Another service very important for the effectiveness of the Grid reliability is the submission and tracking of tickets, GGUS. This has already been tested with a manual procedure. Since the result was very encouraging, we are working on ways of automatizing this interaction. The main problem that we have found so far is the lacking of communication between the new gLite RB and RGMA. Jobs that went through these resource brokers do not publish their status, thus making our tasks impossible. Another possible problem that we might encounter is the confidentiality of the data. To solve this, we are anonymising the jobs and transfers, since we are only interested in the different status that the job or transfer goes through.

Grid reliability

Ejemplares similares