Cargando…

PanDA Pilot Submission using Condor-G: Experience and Improvements

PanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical f...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, X, Hover, J, Wlodek, T, Wenaus, T, Frey, J, Tannenbaum, T, Livny, M
Lenguaje:eng
Publicado: 2011
Materias:
Acceso en línea:http://cds.cern.ch/record/1322128
_version_ 1780921562736623616
author Zhao, X
Hover, J
Wlodek, T
Wenaus, T
Frey, J
Tannenbaum, T
Livny, M
author_facet Zhao, X
Hover, J
Wlodek, T
Wenaus, T
Frey, J
Tannenbaum, T
Livny, M
author_sort Zhao, X
collection CERN
description PanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (ACF) at BNL, as the ATLAS Tier1 center in the US, operates the pilot submission systems for the US. This is done using the PanDA “AutoPilot” scheduler component which submits pilot jobs via Condor-G, a grid job scheduling system developed at the University of Wisconsin-Madison. In this paper, we discuss the operation and performance of the Condor-G pilot submission at BNL, with emphasis on the challenges and issues encountered in the real grid production environment. With the close collaboration of Condor and PanDA teams, the scalability and stability of the overall system has been greatly improved over the last year. We review improvements made to Condor-G resulting from this collaboration, including isolation of site-based issues by running a separate Gridmanager for each remote site, introduction of the 'Nonessential' job attribute to allow Condor to optimize its behavior for the specific character of pilot jobs, better understanding and handling of the Gridmonitor process, as well as better scheduling in the PanDA pilot scheduler component. We will also cover the monitoring of the health of the system.
id cern-1322128
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2011
record_format invenio
spelling cern-13221282019-09-30T06:29:59Zhttp://cds.cern.ch/record/1322128engZhao, XHover, JWlodek, TWenaus, TFrey, JTannenbaum, TLivny, MPanDA Pilot Submission using Condor-G: Experience and ImprovementsDetectors and Experimental TechniquesPanDA (Production and Distributed Analysis) is the workload management system of the ATLAS experiment, used to run managed production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (ACF) at BNL, as the ATLAS Tier1 center in the US, operates the pilot submission systems for the US. This is done using the PanDA “AutoPilot” scheduler component which submits pilot jobs via Condor-G, a grid job scheduling system developed at the University of Wisconsin-Madison. In this paper, we discuss the operation and performance of the Condor-G pilot submission at BNL, with emphasis on the challenges and issues encountered in the real grid production environment. With the close collaboration of Condor and PanDA teams, the scalability and stability of the overall system has been greatly improved over the last year. We review improvements made to Condor-G resulting from this collaboration, including isolation of site-based issues by running a separate Gridmanager for each remote site, introduction of the 'Nonessential' job attribute to allow Condor to optimize its behavior for the specific character of pilot jobs, better understanding and handling of the Gridmonitor process, as well as better scheduling in the PanDA pilot scheduler component. We will also cover the monitoring of the health of the system.ATL-SOFT-PROC-2011-015oai:cds.cern.ch:13221282011-01-13
spellingShingle Detectors and Experimental Techniques
Zhao, X
Hover, J
Wlodek, T
Wenaus, T
Frey, J
Tannenbaum, T
Livny, M
PanDA Pilot Submission using Condor-G: Experience and Improvements
title PanDA Pilot Submission using Condor-G: Experience and Improvements
title_full PanDA Pilot Submission using Condor-G: Experience and Improvements
title_fullStr PanDA Pilot Submission using Condor-G: Experience and Improvements
title_full_unstemmed PanDA Pilot Submission using Condor-G: Experience and Improvements
title_short PanDA Pilot Submission using Condor-G: Experience and Improvements
title_sort panda pilot submission using condor-g: experience and improvements
topic Detectors and Experimental Techniques
url http://cds.cern.ch/record/1322128
work_keys_str_mv AT zhaox pandapilotsubmissionusingcondorgexperienceandimprovements
AT hoverj pandapilotsubmissionusingcondorgexperienceandimprovements
AT wlodekt pandapilotsubmissionusingcondorgexperienceandimprovements
AT wenaust pandapilotsubmissionusingcondorgexperienceandimprovements
AT freyj pandapilotsubmissionusingcondorgexperienceandimprovements
AT tannenbaumt pandapilotsubmissionusingcondorgexperienceandimprovements
AT livnym pandapilotsubmissionusingcondorgexperienceandimprovements