Cargando…

Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework

During the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to...

Descripción completa

Detalles Bibliográficos
Autor principal:	Goncalves, Nuno
Lenguaje:	eng
Publicado:	2016
Materias:	Information Transfer and Management Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/2210639

_version_	1780951822191558656
author	Goncalves, Nuno
author_facet	Goncalves, Nuno
author_sort	Goncalves, Nuno
collection	CERN
description	During the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to accommodate in a full year of operation. Moreover, it is predicted that the data acquisition rates will continue to increase in the future, due to foreseen improvements in the infrastructure within the scope of the High Luminosity LHC project. Despite the efforts for improving and optimizing the current data storage infrastructures (CERN Accelerator Logging Service and Post Mortem database), some limitations still persist and require a different approach to scale up efficiently to provide efficient services for future machine upgrades. This project aims to explore one of the possibilities among novel solutions proposed to solve the problem of working with large datasets. The configuration is composed of Spark for data processing and Hadoop Distributed File System (HDFS) with Parquet format for data storage. This setup tries to enable fast data access without sacrificing the performance of analytical queries (which require large amounts of data to be processed). The workload configurations used in the benchmarking were adapted from previous studies performed by TE-MPE-MS team.
id	cern-2210639
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2016
record_format	invenio
spelling	cern-22106392019-09-30T06:29:59Zhttp://cds.cern.ch/record/2210639engGoncalves, NunoBenchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis FrameworkInformation Transfer and ManagementComputing and ComputersDuring the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to accommodate in a full year of operation. Moreover, it is predicted that the data acquisition rates will continue to increase in the future, due to foreseen improvements in the infrastructure within the scope of the High Luminosity LHC project. Despite the efforts for improving and optimizing the current data storage infrastructures (CERN Accelerator Logging Service and Post Mortem database), some limitations still persist and require a different approach to scale up efficiently to provide efficient services for future machine upgrades. This project aims to explore one of the possibilities among novel solutions proposed to solve the problem of working with large datasets. The configuration is composed of Spark for data processing and Hadoop Distributed File System (HDFS) with Parquet format for data storage. This setup tries to enable fast data access without sacrificing the performance of analytical queries (which require large amounts of data to be processed). The workload configurations used in the benchmarking were adapted from previous studies performed by TE-MPE-MS team.CERN-STUDENTS-Note-2016-146oai:cds.cern.ch:22106392016-08-26
spellingShingle	Information Transfer and Management Computing and Computers Goncalves, Nuno Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title	Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_full	Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_fullStr	Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_full_unstemmed	Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_short	Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_sort	benchmarking of modern data analysis tools for a 2nd generation transient data analysis framework
topic	Information Transfer and Management Computing and Computers
url	http://cds.cern.ch/record/2210639
work_keys_str_mv	AT goncalvesnuno benchmarkingofmoderndataanalysistoolsfora2ndgenerationtransientdataanalysisframework

Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework

Ejemplares similares