Cargando…

Big Data in HEP: A comprehensive use case study

Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerg...

Descripción completa

Detalles Bibliográficos
Autores principales: Gutsche, Oliver, Cremonesi, Matteo, Elmer, Peter, Jayatilaka, Bo, Kowalkowski, Jim, Pivarski, Jim, Sehrish, Saba, Surez, Cristina Mantilla, Svyatkovskiy, Alexey, Tran, Nhan
Lenguaje:eng
Publicado: 2017
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/898/7/072012
http://cds.cern.ch/record/2293843
_version_ 1780956589563314176
author Gutsche, Oliver
Cremonesi, Matteo
Elmer, Peter
Jayatilaka, Bo
Kowalkowski, Jim
Pivarski, Jim
Sehrish, Saba
Surez, Cristina Mantilla
Svyatkovskiy, Alexey
Tran, Nhan
author_facet Gutsche, Oliver
Cremonesi, Matteo
Elmer, Peter
Jayatilaka, Bo
Kowalkowski, Jim
Pivarski, Jim
Sehrish, Saba
Surez, Cristina Mantilla
Svyatkovskiy, Alexey
Tran, Nhan
author_sort Gutsche, Oliver
collection CERN
description Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerged to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity. In this talk, we present an active LHC Run 2 analysis, searching for dark matter with the CMS detector, as a testbed for Big Data technologies. We directly compare the traditional NTuple-based analysis with an equivalent analysis using Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the analysis with the official experiment data formats and produce publication physics plots. We will discuss advantages and disadvantages of each approach and give an outlook on further studies needed.
id cern-2293843
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2017
record_format invenio
spelling cern-22938432023-05-17T03:48:46Zdoi:10.1088/1742-6596/898/7/072012http://cds.cern.ch/record/2293843engGutsche, OliverCremonesi, MatteoElmer, PeterJayatilaka, BoKowalkowski, JimPivarski, JimSehrish, SabaSurez, Cristina MantillaSvyatkovskiy, AlexeyTran, NhanBig Data in HEP: A comprehensive use case studycs.DCComputing and ComputersExperimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerged to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity. In this talk, we present an active LHC Run 2 analysis, searching for dark matter with the CMS detector, as a testbed for Big Data technologies. We directly compare the traditional NTuple-based analysis with an equivalent analysis using Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the analysis with the official experiment data formats and produce publication physics plots. We will discuss advantages and disadvantages of each approach and give an outlook on further studies needed.arXiv:1703.04171FERMILAB-CONF-17-028-CDoai:cds.cern.ch:22938432017-03-12
spellingShingle cs.DC
Computing and Computers
Gutsche, Oliver
Cremonesi, Matteo
Elmer, Peter
Jayatilaka, Bo
Kowalkowski, Jim
Pivarski, Jim
Sehrish, Saba
Surez, Cristina Mantilla
Svyatkovskiy, Alexey
Tran, Nhan
Big Data in HEP: A comprehensive use case study
title Big Data in HEP: A comprehensive use case study
title_full Big Data in HEP: A comprehensive use case study
title_fullStr Big Data in HEP: A comprehensive use case study
title_full_unstemmed Big Data in HEP: A comprehensive use case study
title_short Big Data in HEP: A comprehensive use case study
title_sort big data in hep: a comprehensive use case study
topic cs.DC
Computing and Computers
url https://dx.doi.org/10.1088/1742-6596/898/7/072012
http://cds.cern.ch/record/2293843
work_keys_str_mv AT gutscheoliver bigdatainhepacomprehensiveusecasestudy
AT cremonesimatteo bigdatainhepacomprehensiveusecasestudy
AT elmerpeter bigdatainhepacomprehensiveusecasestudy
AT jayatilakabo bigdatainhepacomprehensiveusecasestudy
AT kowalkowskijim bigdatainhepacomprehensiveusecasestudy
AT pivarskijim bigdatainhepacomprehensiveusecasestudy
AT sehrishsaba bigdatainhepacomprehensiveusecasestudy
AT surezcristinamantilla bigdatainhepacomprehensiveusecasestudy
AT svyatkovskiyalexey bigdatainhepacomprehensiveusecasestudy
AT trannhan bigdatainhepacomprehensiveusecasestudy