Cargando…

Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio

This contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unpr...

Descripción completa

Detalles Bibliográficos
Autores principales: Lassnig, Mario, Vigne, Ralph, Beermann, Thomas Alfons, Barisits, Martin-Stefan, Garonne, Vincent, Serfon, Cedric
Lenguaje:eng
Publicado: 2015
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/664/6/062027
http://cds.cern.ch/record/2015469
_version_ 1780946687294963712
author Lassnig, Mario
Vigne, Ralph
Beermann, Thomas Alfons
Barisits, Martin-Stefan
Garonne, Vincent
Serfon, Cedric
author_facet Lassnig, Mario
Vigne, Ralph
Beermann, Thomas Alfons
Barisits, Martin-Stefan
Garonne, Vincent
Serfon, Cedric
author_sort Lassnig, Mario
collection CERN
description This contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unprecedented data volume. It is therefore not possibly to create a duplicate instance of Rucio for testing or integration. Every software upgrade or configuration change is thus potentially disruptive and requires fail-safe software and automatic error recovery. Rucio uses a three-layer scaling and mitigation strategy based on quasi-realtime monitoring. This strategy mainly employs independent stateless services, automatic failover, and service migration. The technologies used for deployment and mitigation include OpenStack, Puppet, Graphite, HAProxy and Apache. In this contribution, the interplay between these component, their deployment, software mitigation, and the monitoring strategy are discussed.
id cern-2015469
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2015
record_format invenio
spelling cern-20154692022-08-10T12:55:16Zdoi:10.1088/1742-6596/664/6/062027http://cds.cern.ch/record/2015469engLassnig, MarioVigne, RalphBeermann, Thomas AlfonsBarisits, Martin-StefanGaronne, VincentSerfon, CedricScalable and fail-safe deployment of the ATLAS Distributed Data Management system RucioParticle Physics - ExperimentThis contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unprecedented data volume. It is therefore not possibly to create a duplicate instance of Rucio for testing or integration. Every software upgrade or configuration change is thus potentially disruptive and requires fail-safe software and automatic error recovery. Rucio uses a three-layer scaling and mitigation strategy based on quasi-realtime monitoring. This strategy mainly employs independent stateless services, automatic failover, and service migration. The technologies used for deployment and mitigation include OpenStack, Puppet, Graphite, HAProxy and Apache. In this contribution, the interplay between these component, their deployment, software mitigation, and the monitoring strategy are discussed.ATL-SOFT-PROC-2015-023oai:cds.cern.ch:20154692015-05-12
spellingShingle Particle Physics - Experiment
Lassnig, Mario
Vigne, Ralph
Beermann, Thomas Alfons
Barisits, Martin-Stefan
Garonne, Vincent
Serfon, Cedric
Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio
title Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio
title_full Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio
title_fullStr Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio
title_full_unstemmed Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio
title_short Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio
title_sort scalable and fail-safe deployment of the atlas distributed data management system rucio
topic Particle Physics - Experiment
url https://dx.doi.org/10.1088/1742-6596/664/6/062027
http://cds.cern.ch/record/2015469
work_keys_str_mv AT lassnigmario scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio
AT vigneralph scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio
AT beermannthomasalfons scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio
AT barisitsmartinstefan scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio
AT garonnevincent scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio
AT serfoncedric scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio