Cargando…

PanDA Pilot Submission using Condor-G: Experience and Improvements

PanDA is the workload management system of the ATLAS experiment, used to run production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facili...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Xin, Hover, John, Wlodek, Tom, Wenaus, Torre, Frey, Jaime, Tannenbaum, Todd, Livny, Miron
Lenguaje:eng
Publicado: 2010
Materias:
Acceso en línea:http://cds.cern.ch/record/1299834
Descripción
Sumario:PanDA is the workload management system of the ATLAS experiment, used to run production and user analysis jobs on the grid. As a late-binding, pilot-based system, the maintenance of a smooth and steady stream of pilot jobs to all grid sites is critical for PanDA operation. The ATLAS Computing Facility (ACF) at BNL, as the ATLAS Tier 1 center in the US, operates the pilot submission systems for the US. This is done using the PanDA "AutoPilot" scheduler component which submits pilot jobs via Condor-G, a grid job scheduling system developed at the University of Wisconsin-Madison. In this talk, we discuss the operation and performance of the Condor-G pilot submission at BNL, with emphasis on the challenges and issues encountered in the real grid production environment. With the close collaboration of Condor and PanDA teams, the scalability and stability of the overall system has been greatly improved over the last year. We review improvements made to Condor-G resulting from this collaboration, including isolation of site-based issues by running a separate Grid Manager for each remote site, introduction of the 'Nonessential' job attribute to allow Condor to optimize its behavior for the specific characteristics of pilot jobs, better understanding and handling of the Grid Monitor process, better scheduling in the PanDA pilot scheduler component, as well as bug fixes in Condor itself and underlying Globus libraries. We will also cover the monitoring of the health of the system, followed by plans for future improvements.