Cargando…

Reliability Engineering for ATLAS Petascale Data Processing on the Grid

The ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated...

Descripción completa

Detalles Bibliográficos
Autores principales:	Golubkov, D V, Minaenko, A A, Vaniachine, A V
Lenguaje:	eng
Publicado:	2012
Materias:	Detectors and Experimental Techniques
Acceso en línea:	http://cds.cern.ch/record/1462270

_version_	1780925302972612608
author	Golubkov, D V Minaenko, A A Vaniachine, A V
author_facet	Golubkov, D V Minaenko, A A Vaniachine, A V
author_sort	Golubkov, D V
collection	CERN
description	The ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated software and calibrations to improve the quality of the reconstructed data for physics analysis. Data reprocessing involves a significant commitment of computing resources and is conducted on the Grid. The reconstruction of one petabyte of ATLAS data with 1B collision events from the LHC takes about three million core-hours. Petascale data processing on the Grid involves millions of data processing jobs. At such scales, the reprocessing must handle a continuous stream of failures. Automatic job resubmission recovers transient failures at the cost of CPU time used by the failed jobs. Orchestrating ATLAS data processing applications to ensure efficient usage of tens of thousands of CPU-cores, reliability engineering minimizes the reprocessing duration and failure recovery costs. In 2010 reprocessing, the cost to recover transient failures was 6% of the CPU time used for reconstruction. In 2011 reprocessing, the cost used to recover transient failures was reduced to 4% of the CPU time used for the reconstruction. We present reliability engineering analysis of the ATLAS petascale data processing on the Grid.
id	cern-1462270
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2012
record_format	invenio
spelling	cern-14622702019-09-30T06:29:59Zhttp://cds.cern.ch/record/1462270engGolubkov, D VMinaenko, A AVaniachine, A VReliability Engineering for ATLAS Petascale Data Processing on the GridDetectors and Experimental TechniquesThe ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated software and calibrations to improve the quality of the reconstructed data for physics analysis. Data reprocessing involves a significant commitment of computing resources and is conducted on the Grid. The reconstruction of one petabyte of ATLAS data with 1B collision events from the LHC takes about three million core-hours. Petascale data processing on the Grid involves millions of data processing jobs. At such scales, the reprocessing must handle a continuous stream of failures. Automatic job resubmission recovers transient failures at the cost of CPU time used by the failed jobs. Orchestrating ATLAS data processing applications to ensure efficient usage of tens of thousands of CPU-cores, reliability engineering minimizes the reprocessing duration and failure recovery costs. In 2010 reprocessing, the cost to recover transient failures was 6% of the CPU time used for reconstruction. In 2011 reprocessing, the cost used to recover transient failures was reduced to 4% of the CPU time used for the reconstruction. We present reliability engineering analysis of the ATLAS petascale data processing on the Grid.ATL-SOFT-SLIDE-2012-447oai:cds.cern.ch:14622702012-07-17
spellingShingle	Detectors and Experimental Techniques Golubkov, D V Minaenko, A A Vaniachine, A V Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title	Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_full	Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_fullStr	Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_full_unstemmed	Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_short	Reliability Engineering for ATLAS Petascale Data Processing on the Grid
title_sort	reliability engineering for atlas petascale data processing on the grid
topic	Detectors and Experimental Techniques
url	http://cds.cern.ch/record/1462270
work_keys_str_mv	AT golubkovdv reliabilityengineeringforatlaspetascaledataprocessingonthegrid AT minaenkoaa reliabilityengineeringforatlaspetascaledataprocessingonthegrid AT vaniachineav reliabilityengineeringforatlaspetascaledataprocessingonthegrid

Reliability Engineering for ATLAS Petascale Data Processing on the Grid

Ejemplares similares