Cargando…
PanDA Pilot Submission using Condor-G: Experience and Improvements
PanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical f...
Autores principales: | , , , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2011
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1322128 |
_version_ | 1780921562736623616 |
---|---|
author | Zhao, X Hover, J Wlodek, T Wenaus, T Frey, J Tannenbaum, T Livny, M |
author_facet | Zhao, X Hover, J Wlodek, T Wenaus, T Frey, J Tannenbaum, T Livny, M |
author_sort | Zhao, X |
collection | CERN |
description | PanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (ACF) at BNL, as the ATLAS Tier1 center in the US, operates the pilot submission systems for the US. This is done using the PanDA “AutoPilot” scheduler component which submits pilot jobs via Condor-G, a grid job scheduling system developed at the University of Wisconsin-Madison. In this paper, we discuss the operation and performance of the Condor-G pilot submission at BNL, with emphasis on the challenges and issues encountered in the real grid production environment. With the close collaboration of Condor and PanDA teams, the scalability and stability of the overall system has been greatly improved over the last year. We review improvements made to Condor-G resulting from this collaboration, including isolation of site-based issues by running a separate Gridmanager for each remote site, introduction of the 'Nonessential' job attribute to allow Condor to optimize its behavior for the specific character of pilot jobs, better understanding and handling of the Gridmonitor process, as well as better scheduling in the PanDA pilot scheduler component. We will also cover the monitoring of the health of the system. |
id | cern-1322128 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2011 |
record_format | invenio |
spelling | cern-13221282019-09-30T06:29:59Zhttp://cds.cern.ch/record/1322128engZhao, XHover, JWlodek, TWenaus, TFrey, JTannenbaum, TLivny, MPanDA Pilot Submission using Condor-G: Experience and ImprovementsDetectors and Experimental TechniquesPanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (ACF) at BNL, as the ATLAS Tier1 center in the US, operates the pilot submission systems for the US. This is done using the PanDA “AutoPilot” scheduler component which submits pilot jobs via Condor-G, a grid job scheduling system developed at the University of Wisconsin-Madison. In this paper, we discuss the operation and performance of the Condor-G pilot submission at BNL, with emphasis on the challenges and issues encountered in the real grid production environment. With the close collaboration of Condor and PanDA teams, the scalability and stability of the overall system has been greatly improved over the last year. We review improvements made to Condor-G resulting from this collaboration, including isolation of site-based issues by running a separate Gridmanager for each remote site, introduction of the 'Nonessential' job attribute to allow Condor to optimize its behavior for the specific character of pilot jobs, better understanding and handling of the Gridmonitor process, as well as better scheduling in the PanDA pilot scheduler component. We will also cover the monitoring of the health of the system.ATL-SOFT-PROC-2011-015oai:cds.cern.ch:13221282011-01-13 |
spellingShingle | Detectors and Experimental Techniques Zhao, X Hover, J Wlodek, T Wenaus, T Frey, J Tannenbaum, T Livny, M PanDA Pilot Submission using Condor-G: Experience and Improvements |
title | PanDA Pilot Submission using Condor-G: Experience and Improvements |
title_full | PanDA Pilot Submission using Condor-G: Experience and Improvements |
title_fullStr | PanDA Pilot Submission using Condor-G: Experience and Improvements |
title_full_unstemmed | PanDA Pilot Submission using Condor-G: Experience and Improvements |
title_short | PanDA Pilot Submission using Condor-G: Experience and Improvements |
title_sort | panda pilot submission using condor-g: experience and improvements |
topic | Detectors and Experimental Techniques |
url | http://cds.cern.ch/record/1322128 |
work_keys_str_mv | AT zhaox pandapilotsubmissionusingcondorgexperienceandimprovements AT hoverj pandapilotsubmissionusingcondorgexperienceandimprovements AT wlodekt pandapilotsubmissionusingcondorgexperienceandimprovements AT wenaust pandapilotsubmissionusingcondorgexperienceandimprovements AT freyj pandapilotsubmissionusingcondorgexperienceandimprovements AT tannenbaumt pandapilotsubmissionusingcondorgexperienceandimprovements AT livnym pandapilotsubmissionusingcondorgexperienceandimprovements |