Cargando…

Reliability Engineering for ATLAS Petascale Data Processing on the Grid

The ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated...

Descripción completa

Detalles Bibliográficos
Autores principales: Golubkov, D V, Minaenko, A A, Vaniachine, A V
Lenguaje:eng
Publicado: 2012
Materias:
Acceso en línea:http://cds.cern.ch/record/1462270
_version_ 1780925302972612608
author Golubkov, D V
Minaenko, A A
Vaniachine, A V
author_facet Golubkov, D V
Minaenko, A A
Vaniachine, A V
author_sort Golubkov, D V
collection CERN
description The ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated software and calibrations to improve the quality of the reconstructed data for physics analysis. Data reprocessing involves a significant commitment of computing resources and is conducted on the Grid. The reconstruction of one petabyte of ATLAS data with 1B collision events from the LHC takes about three million core-hours. Petascale data processing on the Grid involves millions of data processing jobs. At such scales, the reprocessing must handle a continuous stream of failures. Automatic job resubmission recovers transient failures at the cost of CPU time used by the failed jobs. Orchestrating ATLAS data processing applications to ensure efficient usage of tens of thousands of CPU-cores, reliability engineering minimizes the reprocessing duration and failure recovery costs. In 2010 reprocessing, the cost to recover transient failures was 6% of the CPU time used for reconstruction. In 2011 reprocessing, the cost used to recover transient failures was reduced to 4% of the CPU time used for the reconstruction. We present reliability engineering analysis of the ATLAS petascale data processing on the Grid.
id cern-1462270
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2012
record_format invenio
spelling cern-14622702019-09-30T06:29:59Zhttp://cds.cern.ch/record/1462270engGolubkov, D VMinaenko, A AVaniachine, A VReliability Engineering for ATLAS Petascale Data Processing on the GridDetectors and Experimental TechniquesThe ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated software and calibrations to improve the quality of the reconstructed data for physics analysis. Data reprocessing involves a significant commitment of computing resources and is conducted on the Grid. The reconstruction of one petabyte of ATLAS data with 1B collision events from the LHC takes about three million core-hours. Petascale data processing on the Grid involves millions of data processing jobs. At such scales, the reprocessing must handle a continuous stream of failures. Automatic job resubmission recovers transient failures at the cost of CPU time used by the failed jobs. Orchestrating ATLAS data processing applications to ensure efficient usage of tens of thousands of CPU-cores, reliability engineering minimizes the reprocessing duration and failure recovery costs. In 2010 reprocessing, the cost to recover transient failures was 6% of the CPU time used for reconstruction. In 2011 reprocessing, the cost used to recover transient failures was reduced to 4% of the CPU time used for the reconstruction. We present reliability engineering analysis of the ATLAS petascale data processing on the Grid.ATL-SOFT-SLIDE-2012-447oai:cds.cern.ch:14622702012-07-17
spellingShingle Detectors and Experimental Techniques
Golubkov, D V
Minaenko, A A
Vaniachine, A V
Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_full Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_fullStr Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_full_unstemmed Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_short Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_sort reliability engineering for atlas petascale data processing on the grid
topic Detectors and Experimental Techniques
url http://cds.cern.ch/record/1462270
work_keys_str_mv AT golubkovdv reliabilityengineeringforatlaspetascaledataprocessingonthegrid
AT minaenkoaa reliabilityengineeringforatlaspetascaledataprocessingonthegrid
AT vaniachineav reliabilityengineeringforatlaspetascaledataprocessingonthegrid