Cargando…

Monitoring data transfer latency in CMS computing operations

During the first LHC run, the CMS experiment collected tens of Petabytes of collision and simulated data, which need to be distributed among dozens of computing centres with low latency in order to make efficient use of the resources. While the desired level of throughput has been successfully achie...

Descripción completa

Detalles Bibliográficos
Autores principales:	Bonacorsi, D, Diotalevi, T, Magini, N, Sartirana, A, Taze, M, Wildish, T
Lenguaje:	eng
Publicado:	2015
Materias:	Computing and Computers
Acceso en línea:	https://dx.doi.org/10.1088/1742-6596/664/3/032033 http://cds.cern.ch/record/2134550

_version_	1780949903197863936
author	Bonacorsi, D Diotalevi, T Magini, N Sartirana, A Taze, M Wildish, T
author_facet	Bonacorsi, D Diotalevi, T Magini, N Sartirana, A Taze, M Wildish, T
author_sort	Bonacorsi, D
collection	CERN
description	During the first LHC run, the CMS experiment collected tens of Petabytes of collision and simulated data, which need to be distributed among dozens of computing centres with low latency in order to make efficient use of the resources. While the desired level of throughput has been successfully achieved, it is still common to observe transfer workflows that cannot reach full completion in a timely manner due to a small fraction of stuck files which require operator intervention.For this reason, in 2012 the CMS transfer management system, PhEDEx, was instrumented with a monitoring system to measure file transfer latencies, and to predict the completion time for the transfer of a data set. The operators can detect abnormal patterns in transfer latencies while the transfer is still in progress, and monitor the long-term performance of the transfer infrastructure to plan the data placement strategy.Based on the data collected for one year with the latency monitoring system, we present a study on the different factors that contribute to transfer completion time. As case studies, we analyze several typical CMS transfer workflows, such as distribution of collision event data from CERN or upload of simulated event data from the Tier-2 centres to the archival Tier-1 centres. For each workflow, we present the typical patterns of transfer latencies that have been identified with the latency monitor.We identify the areas in PhEDEx where a development effort can reduce the latency, and we show how we are able to detect stuck transfers which need operator intervention. We propose a set of metrics to alert about stuck subscriptions and prompt for manual intervention, with the aim of improving transfer completion times.
id	oai-inspirehep.net-1413830
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2015
record_format	invenio
spelling	oai-inspirehep.net-14138302022-08-10T13:00:51Zdoi:10.1088/1742-6596/664/3/032033http://cds.cern.ch/record/2134550engBonacorsi, DDiotalevi, TMagini, NSartirana, ATaze, MWildish, TMonitoring data transfer latency in CMS computing operationsComputing and ComputersDuring the first LHC run, the CMS experiment collected tens of Petabytes of collision and simulated data, which need to be distributed among dozens of computing centres with low latency in order to make efficient use of the resources. While the desired level of throughput has been successfully achieved, it is still common to observe transfer workflows that cannot reach full completion in a timely manner due to a small fraction of stuck files which require operator intervention.For this reason, in 2012 the CMS transfer management system, PhEDEx, was instrumented with a monitoring system to measure file transfer latencies, and to predict the completion time for the transfer of a data set. The operators can detect abnormal patterns in transfer latencies while the transfer is still in progress, and monitor the long-term performance of the transfer infrastructure to plan the data placement strategy.Based on the data collected for one year with the latency monitoring system, we present a study on the different factors that contribute to transfer completion time. As case studies, we analyze several typical CMS transfer workflows, such as distribution of collision event data from CERN or upload of simulated event data from the Tier-2 centres to the archival Tier-1 centres. For each workflow, we present the typical patterns of transfer latencies that have been identified with the latency monitor.We identify the areas in PhEDEx where a development effort can reduce the latency, and we show how we are able to detect stuck transfers which need operator intervention. We propose a set of metrics to alert about stuck subscriptions and prompt for manual intervention, with the aim of improving transfer completion times.oai:inspirehep.net:14138302015
spellingShingle	Computing and Computers Bonacorsi, D Diotalevi, T Magini, N Sartirana, A Taze, M Wildish, T Monitoring data transfer latency in CMS computing operations
title	Monitoring data transfer latency in CMS computing operations
title_full	Monitoring data transfer latency in CMS computing operations
title_fullStr	Monitoring data transfer latency in CMS computing operations
title_full_unstemmed	Monitoring data transfer latency in CMS computing operations
title_short	Monitoring data transfer latency in CMS computing operations
title_sort	monitoring data transfer latency in cms computing operations
topic	Computing and Computers
url	https://dx.doi.org/10.1088/1742-6596/664/3/032033 http://cds.cern.ch/record/2134550
work_keys_str_mv	AT bonacorsid monitoringdatatransferlatencyincmscomputingoperations AT diotalevit monitoringdatatransferlatencyincmscomputingoperations AT maginin monitoringdatatransferlatencyincmscomputingoperations AT sartiranaa monitoringdatatransferlatencyincmscomputingoperations AT tazem monitoringdatatransferlatencyincmscomputingoperations AT wildisht monitoringdatatransferlatencyincmscomputingoperations

Monitoring data transfer latency in CMS computing operations

Ejemplares similares