Cargando…

Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication

Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable...

Descripción completa

Detalles Bibliográficos
Autores principales:	Boychenko, Serhiy, Zerlauth, Markus, Garnier, Jean-Christophe, Zenha-Rela, Mario
Lenguaje:	eng
Publicado:	2018
Materias:	Computing and Computers
Acceso en línea:	https://dx.doi.org/10.1145/3154273.3154320 http://cds.cern.ch/record/2800932

_version_	1780972667602468864
author	Boychenko, Serhiy Zerlauth, Markus Garnier, Jean-Christophe Zenha-Rela, Mario
author_facet	Boychenko, Serhiy Zerlauth, Markus Garnier, Jean-Christophe Zenha-Rela, Mario
author_sort	Boychenko, Serhiy
collection	CERN
description	Throughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.
id	cern-2800932
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2018
record_format	invenio
spelling	cern-28009322022-02-07T12:54:02Zdoi:10.1145/3154273.3154320http://cds.cern.ch/record/2800932engBoychenko, SerhiyZerlauth, MarkusGarnier, Jean-ChristopheZenha-Rela, MarioOptimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replicationComputing and ComputersThroughout the last decades, distributed file systems and processing engines have been the primary choice for applications requiring access to large amounts of data. Since the introduction of the MapReduce paradigm, relational databases are being increasingly replaced by more efficient and scalable architectures, in particular in environments where a query is expected to process TBytes or even PBytes of data in a single execution. That is the situation at CERN, where data storage systems that are critical for the safe operation, exploitation and optimization of the particle accelerator complex, are based on traditional databases or file system solutions, which are already working well beyond their initially provisioned capacity. Despite the efficiency of modern distributed data storage and processing engines in handling large amounts of data, they are not optimized for heterogeneous workloads such as they arise in the dynamic environment of one of the world's largest scientific facilities. This contribution presents a Mixed Partitioning Scheme Replication (MPSR) solution that outperforms the conventional distributed processing environment configurations at CERN for virtually the entire parameter space of the accelerator monitoring systems' workload variations. Our main strategy was to replicate the data using different partitioning schemes for each replica, whereas the individual partitioning criteria is dynamically derived from the observed workload. To assess the efficiency of this approach in a wide range of scenarios, a behavioral simulator has been developed to compare and analyze the performance of the MPSR with the current solution. Furthermore we present the first actual results of the Hadoop-based prototype running on a relatively small cluster that not only validates the simulation predictions but also confirms the higher efficiency of the proposed technique.oai:cds.cern.ch:28009322018
spellingShingle	Computing and Computers Boychenko, Serhiy Zerlauth, Markus Garnier, Jean-Christophe Zenha-Rela, Mario Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title	Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_full	Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_fullStr	Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_full_unstemmed	Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_short	Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication
title_sort	optimizing distributed file storage and processing engines for cern's large hadron collider using multi criteria partitioned replication
topic	Computing and Computers
url	https://dx.doi.org/10.1145/3154273.3154320 http://cds.cern.ch/record/2800932
work_keys_str_mv	AT boychenkoserhiy optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication AT zerlauthmarkus optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication AT garnierjeanchristophe optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication AT zenharelamario optimizingdistributedfilestorageandprocessingenginesforcernslargehadroncolliderusingmulticriteriapartitionedreplication

Optimizing distributed file storage and processing engines for CERN's Large Hadron Collider using multi criteria partitioned replication

Ejemplares similares