Cargando…

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

During three years of LHC data taking, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of data being reprocessed every year. In reprocessing on the Grid, failures can occur for a variety of reasons, while Grid heterogeneity makes failures ha...

Descripción completa

Detalles Bibliográficos
Autores principales: Vaniachine, A, Golubkov, D, Karpenko, D
Lenguaje:eng
Publicado: 2013
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/513/3/032101
http://cds.cern.ch/record/1607143
_version_ 1780931715496148992
author Vaniachine, A
Golubkov, D
Karpenko, D
author_facet Vaniachine, A
Golubkov, D
Karpenko, D
author_sort Vaniachine, A
collection CERN
description During three years of LHC data taking, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of data being reprocessed every year. In reprocessing on the Grid, failures can occur for a variety of reasons, while Grid heterogeneity makes failures hard to diagnose and repair quickly. As a result, Big Data processing on the Grid must tolerate a continuous stream of failures, errors and faults. While ATLAS fault-tolerance mechanisms improve the reliability of Big Data processing in the Grid, their benefits come at costs and result in delays making the performance prediction difficult. Reliability Engineering provides a framework for fundamental understanding of the Big Data processing on the Grid, which is not a desirable enhancement but a necessary requirement. In ATLAS, cost monitoring and performance prediction became critical for the success of the reprocessing campaigns conducted in preparation for the major physics conferences. In addition, our Reliability Engineering approach supported continuous improvements in data reprocessing throughput during LHC data taking. The throughput doubled in 2011 vs. 2010 reprocessing, then quadrupled in 2012 vs. 2011 reprocessing. We present the Reliability Engineering analysis of ATLAS data reprocessing campaigns providing the foundation needed to scale up the Big Data processing technologies beyond the petascale.
id cern-1607143
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2013
record_format invenio
spelling cern-16071432019-09-30T06:29:59Zdoi:10.1088/1742-6596/513/3/032101http://cds.cern.ch/record/1607143engVaniachine, AGolubkov, DKarpenko, DReliability Engineering Analysis of ATLAS Data Reprocessing CampaignsDetectors and Experimental TechniquesDuring three years of LHC data taking, the ATLAS collaboration completed three petascale data reprocessing campaigns on the Grid, with up to 2 PB of data being reprocessed every year. In reprocessing on the Grid, failures can occur for a variety of reasons, while Grid heterogeneity makes failures hard to diagnose and repair quickly. As a result, Big Data processing on the Grid must tolerate a continuous stream of failures, errors and faults. While ATLAS fault-tolerance mechanisms improve the reliability of Big Data processing in the Grid, their benefits come at costs and result in delays making the performance prediction difficult. Reliability Engineering provides a framework for fundamental understanding of the Big Data processing on the Grid, which is not a desirable enhancement but a necessary requirement. In ATLAS, cost monitoring and performance prediction became critical for the success of the reprocessing campaigns conducted in preparation for the major physics conferences. In addition, our Reliability Engineering approach supported continuous improvements in data reprocessing throughput during LHC data taking. The throughput doubled in 2011 vs. 2010 reprocessing, then quadrupled in 2012 vs. 2011 reprocessing. We present the Reliability Engineering analysis of ATLAS data reprocessing campaigns providing the foundation needed to scale up the Big Data processing technologies beyond the petascale.ATL-SOFT-PROC-2013-017oai:cds.cern.ch:16071432013-10-09
spellingShingle Detectors and Experimental Techniques
Vaniachine, A
Golubkov, D
Karpenko, D
Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns
title Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns
title_full Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns
title_fullStr Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns
title_full_unstemmed Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns
title_short Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns
title_sort reliability engineering analysis of atlas data reprocessing campaigns
topic Detectors and Experimental Techniques
url https://dx.doi.org/10.1088/1742-6596/513/3/032101
http://cds.cern.ch/record/1607143
work_keys_str_mv AT vaniachinea reliabilityengineeringanalysisofatlasdatareprocessingcampaigns
AT golubkovd reliabilityengineeringanalysisofatlasdatareprocessingcampaigns
AT karpenkod reliabilityengineeringanalysisofatlasdatareprocessingcampaigns