Cargando…
Benchmarking of Modern Data Analysis Tools for a 2nd generation Transient Data Analysis Framework
During the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2016
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2210639 |
Sumario: | During the past year of operating the Large Hadron Collider (LHC), the amount of transient accelerator data to be persisted and analysed has been steadily growing. Since the startup of the LHC in 2006, the amount of weekly data storage requirements exceeded what the systems was initially designed to accommodate in a full year of operation. Moreover, it is predicted that the data acquisition rates will continue to increase in the future, due to foreseen improvements in the infrastructure within the scope of the High Luminosity LHC project. Despite the efforts for improving and optimizing the current data storage infrastructures (CERN Accelerator Logging Service and Post Mortem database), some limitations still persist and require a different approach to scale up efficiently to provide efficient services for future machine upgrades. This project aims to explore one of the possibilities among novel solutions proposed to solve the problem of working with large datasets. The configuration is composed of Spark for data processing and Hadoop Distributed File System (HDFS) with Parquet format for data storage. This setup tries to enable fast data access without sacrificing the performance of analytical queries (which require large amounts of data to be processed). The workload configurations used in the benchmarking were adapted from previous studies performed by TE-MPE-MS team. |
---|