Cargando…
Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio
This contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unpr...
Autores principales: | , , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2015
|
Materias: | |
Acceso en línea: | https://dx.doi.org/10.1088/1742-6596/664/6/062027 http://cds.cern.ch/record/2015469 |
_version_ | 1780946687294963712 |
---|---|
author | Lassnig, Mario Vigne, Ralph Beermann, Thomas Alfons Barisits, Martin-Stefan Garonne, Vincent Serfon, Cedric |
author_facet | Lassnig, Mario Vigne, Ralph Beermann, Thomas Alfons Barisits, Martin-Stefan Garonne, Vincent Serfon, Cedric |
author_sort | Lassnig, Mario |
collection | CERN |
description | This contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unprecedented data volume. It is therefore not possibly to create a duplicate instance of Rucio for testing or integration. Every software upgrade or configuration change is thus potentially disruptive and requires fail-safe software and automatic error recovery. Rucio uses a three-layer scaling and mitigation strategy based on quasi-realtime monitoring. This strategy mainly employs independent stateless services, automatic failover, and service migration. The technologies used for deployment and mitigation include OpenStack, Puppet, Graphite, HAProxy and Apache. In this contribution, the interplay between these component, their deployment, software mitigation, and the monitoring strategy are discussed. |
id | cern-2015469 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2015 |
record_format | invenio |
spelling | cern-20154692022-08-10T12:55:16Zdoi:10.1088/1742-6596/664/6/062027http://cds.cern.ch/record/2015469engLassnig, MarioVigne, RalphBeermann, Thomas AlfonsBarisits, Martin-StefanGaronne, VincentSerfon, CedricScalable and fail-safe deployment of the ATLAS Distributed Data Management system RucioParticle Physics - ExperimentThis contribution details the deployment of Rucio, the ATLAS Distributed Data Management system. The main complication is that Rucio interacts with a wide variety of external services, and connects globally distributed data centres under different technological and administrative control, at an unprecedented data volume. It is therefore not possibly to create a duplicate instance of Rucio for testing or integration. Every software upgrade or configuration change is thus potentially disruptive and requires fail-safe software and automatic error recovery. Rucio uses a three-layer scaling and mitigation strategy based on quasi-realtime monitoring. This strategy mainly employs independent stateless services, automatic failover, and service migration. The technologies used for deployment and mitigation include OpenStack, Puppet, Graphite, HAProxy and Apache. In this contribution, the interplay between these component, their deployment, software mitigation, and the monitoring strategy are discussed.ATL-SOFT-PROC-2015-023oai:cds.cern.ch:20154692015-05-12 |
spellingShingle | Particle Physics - Experiment Lassnig, Mario Vigne, Ralph Beermann, Thomas Alfons Barisits, Martin-Stefan Garonne, Vincent Serfon, Cedric Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio |
title | Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio |
title_full | Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio |
title_fullStr | Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio |
title_full_unstemmed | Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio |
title_short | Scalable and fail-safe deployment of the ATLAS Distributed Data Management system Rucio |
title_sort | scalable and fail-safe deployment of the atlas distributed data management system rucio |
topic | Particle Physics - Experiment |
url | https://dx.doi.org/10.1088/1742-6596/664/6/062027 http://cds.cern.ch/record/2015469 |
work_keys_str_mv | AT lassnigmario scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio AT vigneralph scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio AT beermannthomasalfons scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio AT barisitsmartinstefan scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio AT garonnevincent scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio AT serfoncedric scalableandfailsafedeploymentoftheatlasdistributeddatamanagementsystemrucio |