Cargando…
Monitoring, accounting and automated decision support for the ALICE experiment based on the MonALISA framework
We are developing a general purpose monitoring system for the ALICE experiment, based on the MonALISA framework. MonALISA (Monitoring Agents using a Large Integrated Services Architecture) is a fully distributed system with no single point of failure that is able to collect, store monitoring informa...
Autores principales: | , , , , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2007
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1120925 |
Sumario: | We are developing a general purpose monitoring system for the ALICE experiment, based on the MonALISA framework. MonALISA (Monitoring Agents using a Large Integrated Services Architecture) is a fully distributed system with no single point of failure that is able to collect, store monitoring information and present it as significant perspectives and synthetic views on the status and the trends of the entire system. Furthermore, agents can use it for taking automated operational decisions. Monitoring information is gathered locally from all the components running in each site. The entire flow of information is aggregated on site level by a MonALISA service and then collected and presented in various forms by a central MonALISA Repository. Based on this information, other services take operational decisions such as alerts, triggers, service restarts and automatic production job or transfer submissions. The system monitors all the components: computer clusters (all major parameters of each computing node), jobs status and consumed resources (CPU, both in time and SpecInt2k units, memory, disk usage), jobs network traffic while reading/writing files with xrootd, services availability with details in case of failures (both AliEn and LCG services, proxies lifetime), storage monitoring with detailed information on number of files, available space, or staging and migrating operations, FTD/FTS transfers. The system is reliable and functional for more than two years, representing the main view towards the ALICE Grid. Our focus is now on using the monitoring information for the development of higher level services that can take more intelligent operational decisions. |
---|