Cargando…

Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework

During the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to...

Descripción completa

Detalles Bibliográficos
Autor principal: Goncalves, Nuno
Lenguaje:eng
Publicado: 2016
Materias:
Acceso en línea:http://cds.cern.ch/record/2210639
_version_ 1780951822191558656
author Goncalves, Nuno
author_facet Goncalves, Nuno
author_sort Goncalves, Nuno
collection CERN
description During the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to accommodate in a full year of operation. Moreover, it is predicted that the data acquisition rates will continue to increase in the future, due to foreseen improvements in the infrastructure within the scope of the High Luminosity LHC project. Despite the efforts for improving and optimizing the current data storage infrastructures (CERN Accelerator Logging Service and Post Mortem database), some limitations still persist and require a different approach to scale up efficiently to provide efficient services for future machine upgrades. This project aims to explore one of the possibilities among novel solutions proposed to solve the problem of working with large datasets. The configuration is composed of Spark for data processing and Hadoop Distributed File System (HDFS) with Parquet format for data storage. This setup tries to enable fast data access without sacrificing the performance of analytical queries (which require large amounts of data to be processed). The workload configurations used in the benchmarking were adapted from previous studies performed by TE-MPE-MS team.
id cern-2210639
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2016
record_format invenio
spelling cern-22106392019-09-30T06:29:59Zhttp://cds.cern.ch/record/2210639engGoncalves, NunoBenchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis FrameworkInformation Transfer and ManagementComputing and ComputersDuring the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to accommodate in a full year of operation. Moreover, it is predicted that the data acquisition rates will continue to increase in the future, due to foreseen improvements in the infrastructure within the scope of the High Luminosity LHC project. Despite the efforts for improving and optimizing the current data storage infrastructures (CERN Accelerator Logging Service and Post Mortem database), some limitations still persist and require a different approach to scale up efficiently to provide efficient services for future machine upgrades. This project aims to explore one of the possibilities among novel solutions proposed to solve the problem of working with large datasets. The configuration is composed of Spark for data processing and Hadoop Distributed File System (HDFS) with Parquet format for data storage. This setup tries to enable fast data access without sacrificing the performance of analytical queries (which require large amounts of data to be processed). The workload configurations used in the benchmarking were adapted from previous studies performed by TE-MPE-MS team.CERN-STUDENTS-Note-2016-146oai:cds.cern.ch:22106392016-08-26
spellingShingle Information Transfer and Management
Computing and Computers
Goncalves, Nuno
Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_full Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_fullStr Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_full_unstemmed Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_short Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
title_sort benchmarking of modern data analysis tools for a 2nd generation transient data analysis framework
topic Information Transfer and Management
Computing and Computers
url http://cds.cern.ch/record/2210639
work_keys_str_mv AT goncalvesnuno benchmarkingofmoderndataanalysistoolsfora2ndgenerationtransientdataanalysisframework