Cargando…
Job prioritization in LHCb
LHCb is one of the four high-energy experiments running in the near future at the Large Hadron Collider (LHC) at CERN. LHCb will try to answer some fundamental questions about the asymmetry between matter and anti-matter. The experiment is expected to produce about 2PB of data per year. Those will b...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2007
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1120923 |
Sumario: | LHCb is one of the four high-energy experiments running in the near future at the Large Hadron Collider (LHC) at CERN. LHCb will try to answer some fundamental questions about the asymmetry between matter and anti-matter. The experiment is expected to produce about 2PB of data per year. Those will be distributed to several laboratories all over Europe and then analyzed by the Physics community. To achieve this target LHCb fully uses the Grid to reprocess, replicate and analyze data. The access to the Grid happens through LHCb's own distributed production and analysis system, DIRAC (Distributed Infrastructure with Remote Agent Control). Dirac implements the ‘pull’ job scheduling paradigm, where all the jobs are stored in a central task queues and then pulled via generic grid jobs called Pilot Agents. The whole LHCb community (about 600 people) is divided in sets of physicists, developers, production and software managers that have different needs about their jobs on the Grid. While a Monte Carlo simulation job needs several days of intensive CPU time, the analysis jobs just need to start immediately. The current state of affairs, where all the users access the Grid through a single entry point, does not prevent certain sub-communities running most of the jobs and then monopolizing the use of Grid resources. The way to avoid this is to implement a system that ensures job priority and fair share of the resources among all the community users. There are two possible approaches: a site-wise approach where the VO just takes care of filling up its queues and leaves the site-specific software to redistribute the jobs accordingly to early negotiations; a VO-wise approach, best tailored to the LHCb computing model, where the site just allocates the quota of resources competing to the VO and the VO decides how to share it across its users sub-communities. A rough priority algorithm based on the VO-wise approach has already been implemented. The introduction of a ‘Priority'' flag in the specification of the job and some changes in the resource-job matching mechanism already proved to guarantee the right precedence to short analysis jobs or to Reconstruction jobs with respect of cumbersome Monte Carlo jobs. Our Priority algorithm must be considered as a work-in -progress development. Accounting information based on both the user, job length and community CPU consumption will also be considered. The job priority mechanism needs to be extensively tested. An ageing system will also be introduced to avoid that some jobs stay too long in the central queues before being picked-up at the first suitable resource available. The mechanism relies on the assumption that DIRAC is the only access to the Grid but does not prevent users to bypass it and access the Grid somehow else. A tool to enforce VO policy at site level is then highly desired. |
---|