Cargando…

Workload modelling for data-intensive systems

This thesis presents a comprehensive study built upon the requirements of a global data-intensive system, built for the ATLAS Experiment at CERN's Large Hadron Collider. First, a scalable method is described to capture distributed data management operations in a non-intrusive way. These operati...

Descripción completa

Detalles Bibliográficos
Autor principal: Lassnig, Mario
Lenguaje:eng
Publicado: 2016
Acceso en línea:http://cds.cern.ch/record/2235088
Descripción
Sumario:This thesis presents a comprehensive study built upon the requirements of a global data-intensive system, built for the ATLAS Experiment at CERN's Large Hadron Collider. First, a scalable method is described to capture distributed data management operations in a non-intrusive way. These operations are collected into a globally synchronised sequence of events, the workload. A comparative analysis of this new data-intensive workload against existing computational workloads is conducted, leading to the discovery of the importance of descriptive attributes in the operations. Existing computational workload models only consider the arrival rates of operations, however, in data-intensive systems the correlations between attributes play a central role. Furthermore, the detrimental effect of rapid correlated arrivals, so called bursts, is assessed. A model is proposed that can learn burst behaviour from captured workload, and in turn forecast potential future bursts. To help with the creation of a full representative workload model, a similarity measure is proposed that assesses the internal structure of the workload in a two-step method: the time-dependent attribute is decomposed via wavelet transformation, and descriptive attributes are learnt via association rule mining. Finally, an analytical workload model is proposed, that supports the inherent features of data-intensive systems without a learning step. That way, potential future systems in development can use workload that is representative of data-intensive systems even though no particular historical data is available.