Cargando…
Using Big Data Technologies for HEP Analysis
The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability...
Autores principales: | , , , , , , , , , , , , , , , , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2019
|
Materias: | |
Acceso en línea: | https://dx.doi.org/10.1051/epjconf/201921406030 http://cds.cern.ch/record/2654764 |
_version_ | 1780961130804412416 |
---|---|
author | Cremonesi, Matteo Bellini, Claudio Bian, Bianny Canali, Luca Dimakopoulos, Vasileios Elmer, Peter Fisk, Ian Girone, Maria Gutsche, Oliver Hoh, Siew-Yan Jayatilaka, Bo Khristenko, Viktor Luiselli, Andrea Melo, Andrew Motesnitsalis, Evangelos Olivito, Dominick Pazzini, Jacopo Pivarski, Jim Svyatkovskiy, Alexey Zanetti, Marco |
author_facet | Cremonesi, Matteo Bellini, Claudio Bian, Bianny Canali, Luca Dimakopoulos, Vasileios Elmer, Peter Fisk, Ian Girone, Maria Gutsche, Oliver Hoh, Siew-Yan Jayatilaka, Bo Khristenko, Viktor Luiselli, Andrea Melo, Andrew Motesnitsalis, Evangelos Olivito, Dominick Pazzini, Jacopo Pivarski, Jim Svyatkovskiy, Alexey Zanetti, Marco |
author_sort | Cremonesi, Matteo |
collection | CERN |
description | The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis.The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow. |
id | cern-2654764 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2019 |
record_format | invenio |
spelling | cern-26547642022-02-01T07:20:48Zdoi:10.1051/epjconf/201921406030http://cds.cern.ch/record/2654764engCremonesi, MatteoBellini, ClaudioBian, BiannyCanali, LucaDimakopoulos, VasileiosElmer, PeterFisk, IanGirone, MariaGutsche, OliverHoh, Siew-YanJayatilaka, BoKhristenko, ViktorLuiselli, AndreaMelo, AndrewMotesnitsalis, EvangelosOlivito, DominickPazzini, JacopoPivarski, JimSvyatkovskiy, AlexeyZanetti, MarcoUsing Big Data Technologies for HEP Analysiscs.DCComputing and ComputersThe HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother.In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis.The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother. In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis. The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow.arXiv:1901.07143FERMILAB-PUB-19-037-CD-PPDoai:cds.cern.ch:26547642019 |
spellingShingle | cs.DC Computing and Computers Cremonesi, Matteo Bellini, Claudio Bian, Bianny Canali, Luca Dimakopoulos, Vasileios Elmer, Peter Fisk, Ian Girone, Maria Gutsche, Oliver Hoh, Siew-Yan Jayatilaka, Bo Khristenko, Viktor Luiselli, Andrea Melo, Andrew Motesnitsalis, Evangelos Olivito, Dominick Pazzini, Jacopo Pivarski, Jim Svyatkovskiy, Alexey Zanetti, Marco Using Big Data Technologies for HEP Analysis |
title | Using Big Data Technologies for HEP Analysis |
title_full | Using Big Data Technologies for HEP Analysis |
title_fullStr | Using Big Data Technologies for HEP Analysis |
title_full_unstemmed | Using Big Data Technologies for HEP Analysis |
title_short | Using Big Data Technologies for HEP Analysis |
title_sort | using big data technologies for hep analysis |
topic | cs.DC Computing and Computers |
url | https://dx.doi.org/10.1051/epjconf/201921406030 http://cds.cern.ch/record/2654764 |
work_keys_str_mv | AT cremonesimatteo usingbigdatatechnologiesforhepanalysis AT belliniclaudio usingbigdatatechnologiesforhepanalysis AT bianbianny usingbigdatatechnologiesforhepanalysis AT canaliluca usingbigdatatechnologiesforhepanalysis AT dimakopoulosvasileios usingbigdatatechnologiesforhepanalysis AT elmerpeter usingbigdatatechnologiesforhepanalysis AT fiskian usingbigdatatechnologiesforhepanalysis AT gironemaria usingbigdatatechnologiesforhepanalysis AT gutscheoliver usingbigdatatechnologiesforhepanalysis AT hohsiewyan usingbigdatatechnologiesforhepanalysis AT jayatilakabo usingbigdatatechnologiesforhepanalysis AT khristenkoviktor usingbigdatatechnologiesforhepanalysis AT luiselliandrea usingbigdatatechnologiesforhepanalysis AT meloandrew usingbigdatatechnologiesforhepanalysis AT motesnitsalisevangelos usingbigdatatechnologiesforhepanalysis AT olivitodominick usingbigdatatechnologiesforhepanalysis AT pazzinijacopo usingbigdatatechnologiesforhepanalysis AT pivarskijim usingbigdatatechnologiesforhepanalysis AT svyatkovskiyalexey usingbigdatatechnologiesforhepanalysis AT zanettimarco usingbigdatatechnologiesforhepanalysis |