Cargando…

“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark

<!--HTML-->The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Dat...

Descripción completa

Detalles Bibliográficos
Autor principal: Canali, Luca
Lenguaje:eng
Publicado: 2019
Materias:
Acceso en línea:http://cds.cern.ch/record/2692203
_version_ 1780963931882258432
author Canali, Luca
author_facet Canali, Luca
author_sort Canali, Luca
collection CERN
description <!--HTML-->The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Data” technologies developed and optimized in industry could help analyzing Petabyte and Exabyte size datasets and enable the next big discoveries. In this talk, we present the CERN openlab/Intel project to enable LHC-style analysis on Apache Spark at scale: “The CMS Big Data Reduction Facility”. The goal was to develop the technical capabilities to provide the physics analyst with a data reduction facility. Working together with CERN Openlab and Intel, CMS replicated a real physics analysis using Spark-based technologies, with the ambition of reducing 1 Petabyte of CMS data in 5 hours to 1 Terabyte directly suitable for final analysis. We will present scaling results and facility improvements achieved by using Intel’s CoFluent optimization tool. We will also discuss how the tools and methods developed in the CERN openlab project and in collaboration with Intel, have allowed to develop an end-to-end data and pipeline for a deep learning research work of interest in High Energy Physics (HEP), in particular applied to improving the accuracy of online event filtering. Apache Spark has been used for the data lifting part of the pipeline, Spark with Analytics Zoo and BigDL have been used to run parallel training of the neural networks on CPU clusters.
id cern-2692203
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2019
record_format invenio
spelling cern-26922032022-11-02T22:24:39Zhttp://cds.cern.ch/record/2692203engCanali, Luca“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache SparkIXPUG 2019 Annual Conference at CERNother events or meetings<!--HTML-->The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Data” technologies developed and optimized in industry could help analyzing Petabyte and Exabyte size datasets and enable the next big discoveries. In this talk, we present the CERN openlab/Intel project to enable LHC-style analysis on Apache Spark at scale: “The CMS Big Data Reduction Facility”. The goal was to develop the technical capabilities to provide the physics analyst with a data reduction facility. Working together with CERN Openlab and Intel, CMS replicated a real physics analysis using Spark-based technologies, with the ambition of reducing 1 Petabyte of CMS data in 5 hours to 1 Terabyte directly suitable for final analysis. We will present scaling results and facility improvements achieved by using Intel’s CoFluent optimization tool. We will also discuss how the tools and methods developed in the CERN openlab project and in collaboration with Intel, have allowed to develop an end-to-end data and pipeline for a deep learning research work of interest in High Energy Physics (HEP), in particular applied to improving the accuracy of online event filtering. Apache Spark has been used for the data lifting part of the pipeline, Spark with Analytics Zoo and BigDL have been used to run parallel training of the neural networks on CPU clusters.oai:cds.cern.ch:26922032019
spellingShingle other events or meetings
Canali, Luca
“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_full “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_fullStr “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_full_unstemmed “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_short “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_sort “big data in hep” - physics data analysis, machine learning and data reduction at scale with apache spark
topic other events or meetings
url http://cds.cern.ch/record/2692203
work_keys_str_mv AT canaliluca bigdatainhepphysicsdataanalysismachinelearninganddatareductionatscalewithapachespark
AT canaliluca ixpug2019annualconferenceatcern