Cargando…

Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay

Scientic communities are using a growing number of distributed systems, from lo- cal batch systems, community-specic services and supercomputers to general-purpose, global grid infrastructures. Increasing the research capabilities for science is the raison d'^etre of such infrastructures which...

Descripción completa

Detalles Bibliográficos
Autor principal: Moscicki, Jakub Tomasz
Lenguaje:eng
Publicado: 2013
Materias:
Acceso en línea:http://cds.cern.ch/record/1525936
_version_ 1780929274917683200
author Moscicki, Jakub Tomasz
author_facet Moscicki, Jakub Tomasz
author_sort Moscicki, Jakub Tomasz
collection CERN
description Scientic communities are using a growing number of distributed systems, from lo- cal batch systems, community-specic services and supercomputers to general-purpose, global grid infrastructures. Increasing the research capabilities for science is the raison d'^etre of such infrastructures which provide access to diversied computational, storage and data resources at large scales. Grids are rather chaotic, highly heterogeneous, de- centralized systems where unpredictable workloads, component failures and variability of execution environments are commonplace. Understanding and mastering the hetero- geneity and dynamics of such distributed systems is prohibitive for end users if they are not supported by appropriate methods and tools. The time cost to learn and use the interfaces and idiosyncrasies of dierent distributed environments is another challenge. Obtaining more reliable application execution times and boosting parallel speedup are important to increase the research capabilities of scientic communities. Late bind- ing is one of techniques to achieve these goals because the majority of jobs which are in production in grids and supercomputers are moldable. Moldable jobs may use variable number of resources and be more exibly partitioned than classical, rigid parallel jobs. Moldable job application examples include Monte Carlo simulations, parameter sweeps, directed acyclic graphs and work ows, data-parallel analysis algorithms and many more. We analyze spatial and temporal dynamics and study the performance variations in large, loosely coupled distributed systems such as the EGEE Grid { the largest Grid infrastructure to date. We develop a mathematical description of task processing in the Grid, where system parameters are taken as random variables with empirical dis- tributions. We analyze the Quality of Service indicators such as variance of makespan to qualitatively compare late and early-binding task processing models. Using a con- tinuous approximation we analytically demonstrate that properties of the late-binding model allow to reduce the makespan distribution according to fundamental laws of statistics. To analyze the discrete cases and more complex parameters, including the communication overheads, we use Monte Carlo simulation. We identify that under cer- tain conditions late binding allows to achieve speedups which are often greater than an order of magnitude compared to early binding. We describe the principles guiding the development of a lightweight, User-level Over- lay which exploits late binding to achieve an improved Quality of Service in unreliable and unpredictable distributed environments. Our strategy is based on loosely cou- pled, user-space tools, where the Diane scheduler manages task allocation in a pool of worker nodes which is asynchronously created and managed by the Ganga interface. This approach makes it easy (1) to create resource selection mechanisms such as the heuristic-based worker agent factory, and (2) to plug-in adaptive workload-balancing algorithms for task scheduling. Other key features include an ability to interface to a wide range of distributed systems; an ability to extend and customize the system with application-specic scheduling and processing methods; ease of use and uniform interface to heterogeneous job management systems. Using real-life applications in the EGEE Grid, local batch systems and dedicated clusters we demonstrate new and improved capabilities which are provided by the Ganga/Diane User-level Overlay above generic middleware stack. These capabili- ties include ecient short-deadline computing, increased dependability, autonomous large-scale operations, ecient parameter sweeps, man-in-the loop scenarios, automated DAGs/work ows and semi-interactivity. We present two case-studies of capacity and capability computing with the User-level Overlay. We show how a large number of tasks with a short deadline was coordinated on the Grid to improve dependability of locally available resources for the International Telecommunication Union Regional Radio Conference 2006. Then we describe how task prioritization and resource selection was implemented for the Lattice QCD simulations for the QCD thermodynamics studies in the context of heavy-ion collisions experiments (LHC, RHIC). This work is a contribution to the debate if Quality of Service in grids may be eciently implemented at the application level. We demonstrated that it is indeed possible (1) by giving a theoretical explanation of the eects of late binding on key task processing metrics, and (2) by showing examples of applications which successfully applied the User-level Overlay
id cern-1525936
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2013
record_format invenio
spelling cern-15259362019-09-30T06:29:59Zhttp://cds.cern.ch/record/1525936engMoscicki, Jakub TomaszUnderstanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level OverlayComputing and ComputersScientic communities are using a growing number of distributed systems, from lo- cal batch systems, community-specic services and supercomputers to general-purpose, global grid infrastructures. Increasing the research capabilities for science is the raison d'^etre of such infrastructures which provide access to diversied computational, storage and data resources at large scales. Grids are rather chaotic, highly heterogeneous, de- centralized systems where unpredictable workloads, component failures and variability of execution environments are commonplace. Understanding and mastering the hetero- geneity and dynamics of such distributed systems is prohibitive for end users if they are not supported by appropriate methods and tools. The time cost to learn and use the interfaces and idiosyncrasies of dierent distributed environments is another challenge. Obtaining more reliable application execution times and boosting parallel speedup are important to increase the research capabilities of scientic communities. Late bind- ing is one of techniques to achieve these goals because the majority of jobs which are in production in grids and supercomputers are moldable. Moldable jobs may use variable number of resources and be more exibly partitioned than classical, rigid parallel jobs. Moldable job application examples include Monte Carlo simulations, parameter sweeps, directed acyclic graphs and work ows, data-parallel analysis algorithms and many more. We analyze spatial and temporal dynamics and study the performance variations in large, loosely coupled distributed systems such as the EGEE Grid { the largest Grid infrastructure to date. We develop a mathematical description of task processing in the Grid, where system parameters are taken as random variables with empirical dis- tributions. We analyze the Quality of Service indicators such as variance of makespan to qualitatively compare late and early-binding task processing models. Using a con- tinuous approximation we analytically demonstrate that properties of the late-binding model allow to reduce the makespan distribution according to fundamental laws of statistics. To analyze the discrete cases and more complex parameters, including the communication overheads, we use Monte Carlo simulation. We identify that under cer- tain conditions late binding allows to achieve speedups which are often greater than an order of magnitude compared to early binding. We describe the principles guiding the development of a lightweight, User-level Over- lay which exploits late binding to achieve an improved Quality of Service in unreliable and unpredictable distributed environments. Our strategy is based on loosely cou- pled, user-space tools, where the Diane scheduler manages task allocation in a pool of worker nodes which is asynchronously created and managed by the Ganga interface. This approach makes it easy (1) to create resource selection mechanisms such as the heuristic-based worker agent factory, and (2) to plug-in adaptive workload-balancing algorithms for task scheduling. Other key features include an ability to interface to a wide range of distributed systems; an ability to extend and customize the system with application-specic scheduling and processing methods; ease of use and uniform interface to heterogeneous job management systems. Using real-life applications in the EGEE Grid, local batch systems and dedicated clusters we demonstrate new and improved capabilities which are provided by the Ganga/Diane User-level Overlay above generic middleware stack. These capabili- ties include ecient short-deadline computing, increased dependability, autonomous large-scale operations, ecient parameter sweeps, man-in-the loop scenarios, automated DAGs/work ows and semi-interactivity. We present two case-studies of capacity and capability computing with the User-level Overlay. We show how a large number of tasks with a short deadline was coordinated on the Grid to improve dependability of locally available resources for the International Telecommunication Union Regional Radio Conference 2006. Then we describe how task prioritization and resource selection was implemented for the Lattice QCD simulations for the QCD thermodynamics studies in the context of heavy-ion collisions experiments (LHC, RHIC). This work is a contribution to the debate if Quality of Service in grids may be eciently implemented at the application level. We demonstrated that it is indeed possible (1) by giving a theoretical explanation of the eects of late binding on key task processing metrics, and (2) by showing examples of applications which successfully applied the User-level OverlayCERN-THESIS-2010-289oai:cds.cern.ch:15259362013-03-12T10:13:20Z
spellingShingle Computing and Computers
Moscicki, Jakub Tomasz
Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay
title Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay
title_full Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay
title_fullStr Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay
title_full_unstemmed Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay
title_short Understanding and Mastering Dynamics in Computing Grids: Processing Moldable Tasks with User-Level Overlay
title_sort understanding and mastering dynamics in computing grids: processing moldable tasks with user-level overlay
topic Computing and Computers
url http://cds.cern.ch/record/1525936
work_keys_str_mv AT moscickijakubtomasz understandingandmasteringdynamicsincomputinggridsprocessingmoldabletaskswithuserleveloverlay