Cargando…
Reliability Engineering for ATLAS Petascale Data Processing on the Grid
The ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated...
Autores principales: | , , |
---|---|
Lenguaje: | eng |
Publicado: |
2012
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1462270 |
_version_ | 1780925302972612608 |
---|---|
author | Golubkov, D V Minaenko, A A Vaniachine, A V |
author_facet | Golubkov, D V Minaenko, A A Vaniachine, A V |
author_sort | Golubkov, D V |
collection | CERN |
description | The ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated software and calibrations to improve the quality of the reconstructed data for physics analysis. Data reprocessing involves a significant commitment of computing resources and is conducted on the Grid. The reconstruction of one petabyte of ATLAS data with 1B collision events from the LHC takes about three million core-hours. Petascale data processing on the Grid involves millions of data processing jobs. At such scales, the reprocessing must handle a continuous stream of failures. Automatic job resubmission recovers transient failures at the cost of CPU time used by the failed jobs. Orchestrating ATLAS data processing applications to ensure efficient usage of tens of thousands of CPU-cores, reliability engineering minimizes the reprocessing duration and failure recovery costs. In 2010 reprocessing, the cost to recover transient failures was 6% of the CPU time used for reconstruction. In 2011 reprocessing, the cost used to recover transient failures was reduced to 4% of the CPU time used for the reconstruction. We present reliability engineering analysis of the ATLAS petascale data processing on the Grid. |
id | cern-1462270 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2012 |
record_format | invenio |
spelling | cern-14622702019-09-30T06:29:59Zhttp://cds.cern.ch/record/1462270engGolubkov, D VMinaenko, A AVaniachine, A VReliability Engineering for ATLAS Petascale Data Processing on the GridDetectors and Experimental TechniquesThe ATLAS detector is in its third year of continuous LHC running taking data for physics analysis. A starting point for ATLAS physics analysis is reconstruction of the raw data. First-pass processing takes place shortly after data taking, followed later by reprocessing of the raw data with updated software and calibrations to improve the quality of the reconstructed data for physics analysis. Data reprocessing involves a significant commitment of computing resources and is conducted on the Grid. The reconstruction of one petabyte of ATLAS data with 1B collision events from the LHC takes about three million core-hours. Petascale data processing on the Grid involves millions of data processing jobs. At such scales, the reprocessing must handle a continuous stream of failures. Automatic job resubmission recovers transient failures at the cost of CPU time used by the failed jobs. Orchestrating ATLAS data processing applications to ensure efficient usage of tens of thousands of CPU-cores, reliability engineering minimizes the reprocessing duration and failure recovery costs. In 2010 reprocessing, the cost to recover transient failures was 6% of the CPU time used for reconstruction. In 2011 reprocessing, the cost used to recover transient failures was reduced to 4% of the CPU time used for the reconstruction. We present reliability engineering analysis of the ATLAS petascale data processing on the Grid.ATL-SOFT-SLIDE-2012-447oai:cds.cern.ch:14622702012-07-17 |
spellingShingle | Detectors and Experimental Techniques Golubkov, D V Minaenko, A A Vaniachine, A V Reliability Engineering for ATLAS Petascale Data Processing on the Grid |
title | Reliability Engineering for ATLAS Petascale Data Processing on the Grid |
title_full | Reliability Engineering for ATLAS Petascale Data Processing on the Grid |
title_fullStr | Reliability Engineering for ATLAS Petascale Data Processing on the Grid |
title_full_unstemmed | Reliability Engineering for ATLAS Petascale Data Processing on the Grid |
title_short | Reliability Engineering for ATLAS Petascale Data Processing on the Grid |
title_sort | reliability engineering for atlas petascale data processing on the grid |
topic | Detectors and Experimental Techniques |
url | http://cds.cern.ch/record/1462270 |
work_keys_str_mv | AT golubkovdv reliabilityengineeringforatlaspetascaledataprocessingonthegrid AT minaenkoaa reliabilityengineeringforatlaspetascaledataprocessingonthegrid AT vaniachineav reliabilityengineeringforatlaspetascaledataprocessingonthegrid |