Cargando…

Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication

Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable...

Descripción completa

Detalles Bibliográficos
Autores principales: Boychenko, Serhiy, Zerlauth, Markus, Garnier, Jean-Christophe, Zenha-Rela, Mario
Lenguaje:eng
Publicado: 2018
Materias:
Acceso en línea:https://dx.doi.org/10.1145/3154273.3154320
http://cds.cern.ch/record/2800932
_version_ 1780972667602468864
author Boychenko, Serhiy
Zerlauth, Markus
Garnier, Jean-Christophe
Zenha-Rela, Mario
author_facet Boychenko, Serhiy
Zerlauth, Markus
Garnier, Jean-Christophe
Zenha-Rela, Mario
author_sort Boychenko, Serhiy
collection CERN
description Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.
id cern-2800932
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2018
record_format invenio
spelling cern-28009322022-02-07T12:54:02Zdoi:10.1145/3154273.3154320http://cds.cern.ch/record/2800932engBoychenko, SerhiyZerlauth, MarkusGarnier, Jean-ChristopheZenha-Rela, MarioOptimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replicationComputing and ComputersThroughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.oai:cds.cern.ch:28009322018
spellingShingle Computing and Computers
Boychenko, Serhiy
Zerlauth, Markus
Garnier, Jean-Christophe
Zenha-Rela, Mario
Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_full Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_fullStr Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_full_unstemmed Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_short Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_sort optimizing distributed file storage and processing engines for cern's large hadron collider using multi criteria partitioned replication
topic Computing and Computers
url https://dx.doi.org/10.1145/3154273.3154320
http://cds.cern.ch/record/2800932
work_keys_str_mv AT boychenkoserhiy optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication
AT zerlauthmarkus optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication
AT garnierjeanchristophe optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication
AT zenharelamario optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication