Cargando…

“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark

The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Dat...

Descripción completa

Detalles Bibliográficos
Autor principal:	Canali, Luca
Lenguaje:	eng
Publicado:	2019
Materias:	other events or meetings
Acceso en línea:	http://cds.cern.ch/record/2692203

_version_	1780963931882258432
author	Canali, Luca
author_facet	Canali, Luca
author_sort	Canali, Luca
collection	CERN
description	<!--HTML-->The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Data” technologies developed and optimized in industry could help analyzing Petabyte and Exabyte size datasets and enable the next big discoveries. In this talk, we present the CERN openlab/Intel project to enable LHC-style analysis on Apache Spark at scale: “The CMS Big Data Reduction Facility”. The goal was to develop the technical capabilities to provide the physics analyst with a data reduction facility. Working together with CERN Openlab and Intel, CMS replicated a real physics analysis using Spark-based technologies, with the ambition of reducing 1 Petabyte of CMS data in 5 hours to 1 Terabyte directly suitable for final analysis. We will present scaling results and facility improvements achieved by using Intel’s CoFluent optimization tool. We will also discuss how the tools and methods developed in the CERN openlab project and in collaboration with Intel, have allowed to develop an end-to-end data and pipeline for a deep learning research work of interest in High Energy Physics (HEP), in particular applied to improving the accuracy of online event filtering. Apache Spark has been used for the data lifting part of the pipeline, Spark with Analytics Zoo and BigDL have been used to run parallel training of the neural networks on CPU clusters.
id	cern-2692203
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2019
record_format	invenio
spelling	cern-26922032022-11-02T22:24:39Zhttp://cds.cern.ch/record/2692203engCanali, Luca“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache SparkIXPUG 2019 Annual Conference at CERNother events or meetings<!--HTML-->The field of High Energy Physics is approaching an era were excellent performance of particle accelerators delivers enormous numbers of collisions. The growing size of these datasets could potentially become a limiting factor in the capability to produce scientific results. “Big Data” technologies developed and optimized in industry could help analyzing Petabyte and Exabyte size datasets and enable the next big discoveries. In this talk, we present the CERN openlab/Intel project to enable LHC-style analysis on Apache Spark at scale: “The CMS Big Data Reduction Facility”. The goal was to develop the technical capabilities to provide the physics analyst with a data reduction facility. Working together with CERN Openlab and Intel, CMS replicated a real physics analysis using Spark-based technologies, with the ambition of reducing 1 Petabyte of CMS data in 5 hours to 1 Terabyte directly suitable for final analysis. We will present scaling results and facility improvements achieved by using Intel’s CoFluent optimization tool. We will also discuss how the tools and methods developed in the CERN openlab project and in collaboration with Intel, have allowed to develop an end-to-end data and pipeline for a deep learning research work of interest in High Energy Physics (HEP), in particular applied to improving the accuracy of online event filtering. Apache Spark has been used for the data lifting part of the pipeline, Spark with Analytics Zoo and BigDL have been used to run parallel training of the neural networks on CPU clusters.oai:cds.cern.ch:26922032019
spellingShingle	other events or meetings Canali, Luca “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title	“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_full	“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_fullStr	“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_full_unstemmed	“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_short	“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark
title_sort	“big data in hep” - physics data analysis, machine learning and data reduction at scale with apache spark
topic	other events or meetings
url	http://cds.cern.ch/record/2692203
work_keys_str_mv	AT canaliluca bigdatainhepphysicsdataanalysismachinelearninganddatareductionatscalewithapachespark AT canaliluca ixpug2019annualconferenceatcern

“Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark

Ejemplares similares