Cargando…

Non-intrusive Quality Analysis of Monitoring Data

Any large-scale operational system running over a variety of devices requires a monitoring mechanism to assess the health of the overall system. The Technical Infrastructure Monitoring System (TIM) at CERN is one such system, and monitors a wide variety of devices and their properties, such as elect...

Descripción completa

Detalles Bibliográficos
Autores principales: Brightwell, M, Ailamaki, Anastasia, Suwalska, Anna
Lenguaje:eng
Publicado: 2010
Materias:
Acceso en línea:https://dx.doi.org/10.1007/978-3-642-13818-8_20
http://cds.cern.ch/record/1359261
Descripción
Sumario:Any large-scale operational system running over a variety of devices requires a monitoring mechanism to assess the health of the overall system. The Technical Infrastructure Monitoring System (TIM) at CERN is one such system, and monitors a wide variety of devices and their properties, such as electricity supplies, device temperatures, liquid flows etc. Without adequate quality assurance, the data collected from such devices leads to false-positives and false-negatives, reducing the effectiveness of the monitoring system. The quality must, however, be measured in a non-intrusive way, so that the critical path of the data flow is not affected by the quality computation. The quality computation should also scale to large volumes of incoming data. To address these challenges, we develop a new statistical module, which monitors the data collected by TIM and reports its quality to the operators. The statistical module uses Oracle RDBMS as the underlying store, and builds hierarchical summaries on the basic events to scale to the volume of data. It has built-in fault-tolerance capability to recover from multiple computation failures. In this paper, we describe the design of the statistical module, and its usefulness for all parties involved with TIM: the system administrators, the operators using the system to monitor the devices, and the engineers responsible for attaching them to the system. We present concrete examples of how the software module helped with the monitoring, configuration and design of TIM since its introduction last year.