Cargando…

Multi-criteria partitioning on distributed file systems for efficient accelerator data analysis and performance optimization

Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query will process Terabtes or even Petabytes of data in a single execution. The same tendency is observed at CER...

Descripción completa

Detalles Bibliográficos
Autores principales: Boychenko, Serhiy, Galilee, Marc-Antoine, Garnier, Jean-Christophe, Zenha-Rela, Mario, Zerlauth, Markus
Lenguaje:eng
Publicado: 2018
Materias:
Acceso en línea:https://dx.doi.org/10.18429/JACoW-ICALEPCS2017-THPHA036
http://cds.cern.ch/record/2305508
Descripción
Sumario:Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query will process Terabtes or even Petabytes of data in a single execution. The same tendency is observed at CERN, where data archiving systems for operational accelerator data are already working well beyond their initially provisioned capacity. Most of the modern data analysis frameworks are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest accelerator complex. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) as a solution that will outperform conventional distributed processing environment configurations for almost the entire phase-space of data analysis use cases and performance optimization challenges as they arise during the commissioning and operational phases of an accelerator. We will present results of a statistical analysis as well as the benchmarking of the implemented prototype, which allow defining the characteristics of the proposed approach and to confirm the expected performance gains.