Cargando…

“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark

<!--HTML-->The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Dat...

Descripción completa

Detalles Bibliográficos
Autor principal: Canali, Luca
Lenguaje:eng
Publicado: 2019
Materias:
Acceso en línea:http://cds.cern.ch/record/2692203
Descripción
Sumario:<!--HTML-->The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Data” technologies developed and optimized in industry could help analyzing Petabyte and Exabyte size datasets and enable the next big discoveries. In this talk, we present the CERN openlab/Intel project to enable LHC-style analysis on Apache Spark at scale: “The CMS Big Data Reduction Facility”. The goal was to develop the technical capabilities to provide the physics analyst with a data reduction facility. Working together with CERN Openlab and Intel, CMS replicated a real physics analysis using Spark-based technologies, with the ambition of reducing 1 Petabyte of CMS data in 5 hours to 1 Terabyte directly suitable for final analysis. We will present scaling results and facility improvements achieved by using Intel’s CoFluent optimization tool. We will also discuss how the tools and methods developed in the CERN openlab project and in collaboration with Intel, have allowed to develop an end-to-end data and pipeline for a deep learning research work of interest in High Energy Physics (HEP), in particular applied to improving the accuracy of online event filtering. Apache Spark has been used for the data lifting part of the pipeline, Spark with Analytics Zoo and BigDL have been used to run parallel training of the neural networks on CPU clusters.