Cargando…
Performance Improvements of EventIndex Distributed System at CERN
The work presented in this thesis is framed in the context of the EventIndex project of the ATLAS experiment, a big particle detector of the LHC (Large Hadron Collider) at CERN. The objective of the project is to catalog all the particle collisions, or events, recorded at the ATLAS detector and also...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2023
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2852032 |
Sumario: | The work presented in this thesis is framed in the context of the EventIndex project of the ATLAS experiment, a big particle detector of the LHC (Large Hadron Collider) at CERN. The objective of the project is to catalog all the particle collisions, or events, recorded at the ATLAS detector and also simulated over the duration of the experiment. With this catalog, data can be characterized at event granularity, important for searching and locating events by the end users. Other automatic checkings can be done in the data reprocessing chain, in order assure its correcteness and optimize future processings. Due to the rise in the production rates and total volume of the data expected for Run 3 (2022-2025) and the HL-LHC (end of the 2020 decade), a scalable system is required also to simplify previous implementations. In this thesis we present the contributions to the project in the areas of distributed data collection, storage of massive volume of data and access to them. A small quantity of information (metadata) by event is indexed at CERN (Tier-0), and distributedly worldwide in the grid in all the computing centers part of the ATLAS Experiment (10 Tier-1, and around 70 Tier-2). We present a new pull model for data collection in the grid with and object store as temporary store, from where the data can be dinamically retrieved to be ingested at the final backend. We also present the contributions a new as a big data store using HBasae/Phoenix, able to sustain the required data rates and total volume of data, and that simplifies the limitations of the previous hybrid solutions. Finally, we present a computing framework and tools using Spark for the data access, and solving the anaylitic use cases workloads that access large amount of data, as the overlaps calculation, or duplicate events detection. |
---|