Cargando…

Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication

Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable...

Descripción completa

Detalles Bibliográficos
Autores principales: Boychenko, Serhiy, Zerlauth, Markus, Garnier, Jean-Christophe, Zenha-Rela, Mario
Lenguaje:eng
Publicado: 2018
Materias:
Acceso en línea:https://dx.doi.org/10.1145/3154273.3154320
http://cds.cern.ch/record/2800932
Descripción
Sumario:Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.