Cargando…

Using Big Data Technologies for HEP Analysis

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability...

Descripción completa

Detalles Bibliográficos
Autores principales: Cremonesi, Matteo, Bellini, Claudio, Bian, Bianny, Canali, Luca, Dimakopoulos, Vasileios, Elmer, Peter, Fisk, Ian, Girone, Maria, Gutsche, Oliver, Hoh, Siew-Yan, Jayatilaka, Bo, Khristenko, Viktor, Luiselli, Andrea, Melo, Andrew, Motesnitsalis, Evangelos, Olivito, Dominick, Pazzini, Jacopo, Pivarski, Jim, Svyatkovskiy, Alexey, Zanetti, Marco
Lenguaje:eng
Publicado: 2019
Materias:
Acceso en línea:https://dx.doi.org/10.1051/epjconf/201921406030
http://cds.cern.ch/record/2654764
_version_ 1780961130804412416
author Cremonesi, Matteo
Bellini, Claudio
Bian, Bianny
Canali, Luca
Dimakopoulos, Vasileios
Elmer, Peter
Fisk, Ian
Girone, Maria
Gutsche, Oliver
Hoh, Siew-Yan
Jayatilaka, Bo
Khristenko, Viktor
Luiselli, Andrea
Melo, Andrew
Motesnitsalis, Evangelos
Olivito, Dominick
Pazzini, Jacopo
Pivarski, Jim
Svyatkovskiy, Alexey
Zanetti, Marco
author_facet Cremonesi, Matteo
Bellini, Claudio
Bian, Bianny
Canali, Luca
Dimakopoulos, Vasileios
Elmer, Peter
Fisk, Ian
Girone, Maria
Gutsche, Oliver
Hoh, Siew-Yan
Jayatilaka, Bo
Khristenko, Viktor
Luiselli, Andrea
Melo, Andrew
Motesnitsalis, Evangelos
Olivito, Dominick
Pazzini, Jacopo
Pivarski, Jim
Svyatkovskiy, Alexey
Zanetti, Marco
author_sort Cremonesi, Matteo
collection CERN
description The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis.The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.
id cern-2654764
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2019
record_format invenio
spelling cern-26547642022-02-01T07:20:48Zdoi:10.1051/epjconf/201921406030http://cds.cern.ch/record/2654764engCremonesi, MatteoBellini, ClaudioBian, BiannyCanali, LucaDimakopoulos, VasileiosElmer, PeterFisk, IanGirone, MariaGutsche, OliverHoh, Siew-YanJayatilaka, BoKhristenko, ViktorLuiselli, AndreaMelo, AndrewMotesnitsalis, EvangelosOlivito, DominickPazzini, JacopoPivarski, JimSvyatkovskiy, AlexeyZanetti, MarcoUsing Big Data Technologies for HEP Analysiscs.DCComputing and ComputersThe HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis.The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother. In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis. The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.arXiv:1901.07143FERMILAB-PUB-19-037-CD-PPDoai:cds.cern.ch:26547642019
spellingShingle cs.DC
Computing and Computers
Cremonesi, Matteo
Bellini, Claudio
Bian, Bianny
Canali, Luca
Dimakopoulos, Vasileios
Elmer, Peter
Fisk, Ian
Girone, Maria
Gutsche, Oliver
Hoh, Siew-Yan
Jayatilaka, Bo
Khristenko, Viktor
Luiselli, Andrea
Melo, Andrew
Motesnitsalis, Evangelos
Olivito, Dominick
Pazzini, Jacopo
Pivarski, Jim
Svyatkovskiy, Alexey
Zanetti, Marco
Using Big Data Technologies for HEP Analysis
title Using Big Data Technologies for HEP Analysis
title_full Using Big Data Technologies for HEP Analysis
title_fullStr Using Big Data Technologies for HEP Analysis
title_full_unstemmed Using Big Data Technologies for HEP Analysis
title_short Using Big Data Technologies for HEP Analysis
title_sort using big data technologies for hep analysis
topic cs.DC
Computing and Computers
url https://dx.doi.org/10.1051/epjconf/201921406030
http://cds.cern.ch/record/2654764
work_keys_str_mv AT cremonesimatteo usingbigdatatechnologiesforhepanalysis
AT belliniclaudio usingbigdatatechnologiesforhepanalysis
AT bianbianny usingbigdatatechnologiesforhepanalysis
AT canaliluca usingbigdatatechnologiesforhepanalysis
AT dimakopoulosvasileios usingbigdatatechnologiesforhepanalysis
AT elmerpeter usingbigdatatechnologiesforhepanalysis
AT fiskian usingbigdatatechnologiesforhepanalysis
AT gironemaria usingbigdatatechnologiesforhepanalysis
AT gutscheoliver usingbigdatatechnologiesforhepanalysis
AT hohsiewyan usingbigdatatechnologiesforhepanalysis
AT jayatilakabo usingbigdatatechnologiesforhepanalysis
AT khristenkoviktor usingbigdatatechnologiesforhepanalysis
AT luiselliandrea usingbigdatatechnologiesforhepanalysis
AT meloandrew usingbigdatatechnologiesforhepanalysis
AT motesnitsalisevangelos usingbigdatatechnologiesforhepanalysis
AT olivitodominick usingbigdatatechnologiesforhepanalysis
AT pazzinijacopo usingbigdatatechnologiesforhepanalysis
AT pivarskijim usingbigdatatechnologiesforhepanalysis
AT svyatkovskiyalexey usingbigdatatechnologiesforhepanalysis
AT zanettimarco usingbigdatatechnologiesforhepanalysis