Cargando…

Integration of Titan supercomputer at OLCF with ATLAS Production System

The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Barreiro Megino, Fernando Harald, De, Kaushik, Klimentov, Alexei, Nilsson, Paul, Oleynik, Danila, Padolski, Siarhei, Panitkin, Sergey, Wenaus, Torre
Lenguaje:	eng
Publicado:	2017
Materias:	Particle Physics - Experiment
Acceso en línea:	https://dx.doi.org/10.1088/1742-6596/898/9/092002 http://cds.cern.ch/record/2241966

_version_	1780953248860995584
author	Barreiro Megino, Fernando Harald De, Kaushik Klimentov, Alexei Nilsson, Paul Oleynik, Danila Padolski, Siarhei Panitkin, Sergey Wenaus, Torre
author_facet	Barreiro Megino, Fernando Harald De, Kaushik Klimentov, Alexei Nilsson, Paul Oleynik, Danila Padolski, Siarhei Panitkin, Sergey Wenaus, Torre
author_sort	Barreiro Megino, Fernando Harald
collection	CERN
description	The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan’s batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan’s multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency. Notice: This manuscript has been authored, by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.
id	cern-2241966
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2017
record_format	invenio
spelling	cern-22419662019-10-15T15:18:50Zdoi:10.1088/1742-6596/898/9/092002http://cds.cern.ch/record/2241966engBarreiro Megino, Fernando HaraldDe, KaushikKlimentov, AlexeiNilsson, PaulOleynik, DanilaPadolski, SiarheiPanitkin, SergeyWenaus, TorreIntegration of Titan supercomputer at OLCF with ATLAS Production SystemParticle Physics - ExperimentThe PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan’s batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan’s multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency. Notice: This manuscript has been authored, by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.ATL-SOFT-PROC-2017-014oai:cds.cern.ch:22419662017-01-13
spellingShingle	Particle Physics - Experiment Barreiro Megino, Fernando Harald De, Kaushik Klimentov, Alexei Nilsson, Paul Oleynik, Danila Padolski, Siarhei Panitkin, Sergey Wenaus, Torre Integration of Titan supercomputer at OLCF with ATLAS Production System
title	Integration of Titan supercomputer at OLCF with ATLAS Production System
title_full	Integration of Titan supercomputer at OLCF with ATLAS Production System
title_fullStr	Integration of Titan supercomputer at OLCF with ATLAS Production System
title_full_unstemmed	Integration of Titan supercomputer at OLCF with ATLAS Production System
title_short	Integration of Titan supercomputer at OLCF with ATLAS Production System
title_sort	integration of titan supercomputer at olcf with atlas production system
topic	Particle Physics - Experiment
url	https://dx.doi.org/10.1088/1742-6596/898/9/092002 http://cds.cern.ch/record/2241966
work_keys_str_mv	AT barreiromeginofernandoharald integrationoftitansupercomputeratolcfwithatlasproductionsystem AT dekaushik integrationoftitansupercomputeratolcfwithatlasproductionsystem AT klimentovalexei integrationoftitansupercomputeratolcfwithatlasproductionsystem AT nilssonpaul integrationoftitansupercomputeratolcfwithatlasproductionsystem AT oleynikdanila integrationoftitansupercomputeratolcfwithatlasproductionsystem AT padolskisiarhei integrationoftitansupercomputeratolcfwithatlasproductionsystem AT panitkinsergey integrationoftitansupercomputeratolcfwithatlasproductionsystem AT wenaustorre integrationoftitansupercomputeratolcfwithatlasproductionsystem

Integration of Titan supercomputer at OLCF with ATLAS Production System

Ejemplares similares