Cargando…

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to these challenges is presented, which allows training neural net...

Descripción completa

Detalles Bibliográficos
Autores principales: Migliorini, Matteo, Castellotti, Riccardo, Canali, Luca, Zanetti, Marco
Lenguaje:eng
Publicado: 2019
Materias:
Acceso en línea:http://cds.cern.ch/record/2692993
_version_ 1780963985558863872
author Migliorini, Matteo
Castellotti, Riccardo
Canali, Luca
Zanetti, Marco
author_facet Migliorini, Matteo
Castellotti, Riccardo
Canali, Luca
Zanetti, Marco
author_sort Migliorini, Matteo
collection CERN
description The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to these challenges is presented, which allows training neural network classifiers using solutions from the Big Data and data science ecosystems, integrated with tools, software, and platforms common in the HEP environment. In particular, Apache Spark is exploited for data preparation and feature engineering, running the corresponding (Python) code interactively on Jupyter notebooks. Key integrations and libraries that make Spark capable of ingesting data stored using ROOT format and accessed via the XRootD protocol, are described and discussed. Training of the neural network models, defined using the Keras API, is performed in a distributed fashion on Spark clusters by using BigDL with Analytics Zoo and also by using TensorFlow, notably for distributed training on CPU and using GPUs. The implementation and the results of the distributed training are described in detail in this work.
id cern-2692993
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2019
record_format invenio
spelling cern-26929932023-03-21T13:20:43Zhttp://cds.cern.ch/record/2692993engMigliorini, MatteoCastellotti, RiccardoCanali, LucaZanetti, MarcoMachine Learning Pipelines with Modern Big Data Tools for High Energy Physicshep-exParticle Physics - Experimentcs.LGComputing and Computerscs.DCComputing and ComputersThe effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to these challenges is presented, which allows training neural network classifiers using solutions from the Big Data and data science ecosystems, integrated with tools, software, and platforms common in the HEP environment. In particular, Apache Spark is exploited for data preparation and feature engineering, running the corresponding (Python) code interactively on Jupyter notebooks. Key integrations and libraries that make Spark capable of ingesting data stored using ROOT format and accessed via the XRootD protocol, are described and discussed. Training of the neural network models, defined using the Keras API, is performed in a distributed fashion on Spark clusters by using BigDL with Analytics Zoo and also by using TensorFlow, notably for distributed training on CPU and using GPUs. The implementation and the results of the distributed training are described in detail in this work.arXiv:1909.10389oai:cds.cern.ch:26929932019
spellingShingle hep-ex
Particle Physics - Experiment
cs.LG
Computing and Computers
cs.DC
Computing and Computers
Migliorini, Matteo
Castellotti, Riccardo
Canali, Luca
Zanetti, Marco
Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
title Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
title_full Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
title_fullStr Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
title_full_unstemmed Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
title_short Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
title_sort machine learning pipelines with modern big data tools for high energy physics
topic hep-ex
Particle Physics - Experiment
cs.LG
Computing and Computers
cs.DC
Computing and Computers
url http://cds.cern.ch/record/2692993
work_keys_str_mv AT migliorinimatteo machinelearningpipelineswithmodernbigdatatoolsforhighenergyphysics
AT castellottiriccardo machinelearningpipelineswithmodernbigdatatoolsforhighenergyphysics
AT canaliluca machinelearningpipelineswithmodernbigdatatoolsforhighenergyphysics
AT zanettimarco machinelearningpipelineswithmodernbigdatatoolsforhighenergyphysics