Cargando…

Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay

Scientic communities are using a growing number of distributed systems, from lo- cal batch systems, community-specic services and supercomputers to general-purpose, global grid infrastructures. Increasing the research capabilities for science is the raison d'^etre of such infrastructures which...

Descripción completa

Detalles Bibliográficos
Autor principal: Moscicki, Jakub Tomasz
Lenguaje:eng
Publicado: 2013
Materias:
Acceso en línea:http://cds.cern.ch/record/1525936
Descripción
Sumario:Scientic communities are using a growing number of distributed systems, from lo- cal batch systems, community-specic services and supercomputers to general-purpose, global grid infrastructures. Increasing the research capabilities for science is the raison d'^etre of such infrastructures which provide access to diversied computational, storage and data resources at large scales. Grids are rather chaotic, highly heterogeneous, de- centralized systems where unpredictable workloads, component failures and variability of execution environments are commonplace. Understanding and mastering the hetero- geneity and dynamics of such distributed systems is prohibitive for end users if they are not supported by appropriate methods and tools. The time cost to learn and use the interfaces and idiosyncrasies of dierent distributed environments is another challenge. Obtaining more reliable application execution times and boosting parallel speedup are important to increase the research capabilities of scientic communities. Late bind- ing is one of techniques to achieve these goals because the majority of jobs which are in production in grids and supercomputers are moldable. Moldable jobs may use variable number of resources and be more exibly partitioned than classical, rigid parallel jobs. Moldable job application examples include Monte Carlo simulations, parameter sweeps, directed acyclic graphs and work ows, data-parallel analysis algorithms and many more. We analyze spatial and temporal dynamics and study the performance variations in large, loosely coupled distributed systems such as the EGEE Grid { the largest Grid infrastructure to date. We develop a mathematical description of task processing in the Grid, where system parameters are taken as random variables with empirical dis- tributions. We analyze the Quality of Service indicators such as variance of makespan to qualitatively compare late and early-binding task processing models. Using a con- tinuous approximation we analytically demonstrate that properties of the late-binding model allow to reduce the makespan distribution according to fundamental laws of statistics. To analyze the discrete cases and more complex parameters, including the communication overheads, we use Monte Carlo simulation. We identify that under cer- tain conditions late binding allows to achieve speedups which are often greater than an order of magnitude compared to early binding. We describe the principles guiding the development of a lightweight, User-level Over- lay which exploits late binding to achieve an improved Quality of Service in unreliable and unpredictable distributed environments. Our strategy is based on loosely cou- pled, user-space tools, where the Diane scheduler manages task allocation in a pool of worker nodes which is asynchronously created and managed by the Ganga interface. This approach makes it easy (1) to create resource selection mechanisms such as the heuristic-based worker agent factory, and (2) to plug-in adaptive workload-balancing algorithms for task scheduling. Other key features include an ability to interface to a wide range of distributed systems; an ability to extend and customize the system with application-specic scheduling and processing methods; ease of use and uniform interface to heterogeneous job management systems. Using real-life applications in the EGEE Grid, local batch systems and dedicated clusters we demonstrate new and improved capabilities which are provided by the Ganga/Diane User-level Overlay above generic middleware stack. These capabili- ties include ecient short-deadline computing, increased dependability, autonomous large-scale operations, ecient parameter sweeps, man-in-the loop scenarios, automated DAGs/work ows and semi-interactivity. We present two case-studies of capacity and capability computing with the User-level Overlay. We show how a large number of tasks with a short deadline was coordinated on the Grid to improve dependability of locally available resources for the International Telecommunication Union Regional Radio Conference 2006. Then we describe how task prioritization and resource selection was implemented for the Lattice QCD simulations for the QCD thermodynamics studies in the context of heavy-ion collisions experiments (LHC, RHIC). This work is a contribution to the debate if Quality of Service in grids may be eciently implemented at the application level. We demonstrated that it is indeed possible (1) by giving a theoretical explanation of the eects of late binding on key task processing metrics, and (2) by showing examples of applications which successfully applied the User-level Overlay