Cargando…

Leveraging HPC resources with distributed RDataFrame

The declarative approach to data analysis provides high-level abstractions for users to operate on their datasets in a much more ergonomic fashion compared to imperative interfaces. ROOT offers such a tool with RDataFrame, which has been tested in production environments and used in real-world analy...

Descripción completa

Detalles Bibliográficos
Autores principales: Padulano, V E, Kabadzhov, I D, Saavedra, E T, Guiraud, E
Lenguaje:eng
Publicado: 2023
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/2438/1/012097
http://cds.cern.ch/record/2871815
_version_ 1780978568147238912
author Padulano, V E
Kabadzhov, I D
Saavedra, E T
Guiraud, E
author_facet Padulano, V E
Kabadzhov, I D
Saavedra, E T
Guiraud, E
author_sort Padulano, V E
collection CERN
description The declarative approach to data analysis provides high-level abstractions for users to operate on their datasets in a much more ergonomic fashion compared to imperative interfaces. ROOT offers such a tool with RDataFrame, which has been tested in production environments and used in real-world analyses with optimal results. Its programming model acts by creating a computation graph with the operations issued by the user and executing it lazily only when the final results are queried. It has always been oriented towards parallelisation, with native support for multi-thread execution on a single machine. Recently, RDataFrame has been extended with a Python layer that is capable of steering and executing the RDataFrame computation graph over a set of distributed resources. In addition, such a layer requires minimal code changes for an RDataFrame application to run distributedly. The new tool effectively allows running a C++ event loop based on RDataFrame while leveraging common industry tools like Dask to schedule the usage of resources. This work presents results and insights gathered through the distributed RDataFrame tool running a physics analysis connecting multiple nodes with a Dask scheduler that requests resources from a Slurm cluster.
id cern-2871815
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2023
record_format invenio
spelling cern-28718152023-09-20T21:01:03Zdoi:10.1088/1742-6596/2438/1/012097http://cds.cern.ch/record/2871815engPadulano, V EKabadzhov, I DSaavedra, E TGuiraud, ELeveraging HPC resources with distributed RDataFrameComputing and ComputersThe declarative approach to data analysis provides high-level abstractions for users to operate on their datasets in a much more ergonomic fashion compared to imperative interfaces. ROOT offers such a tool with RDataFrame, which has been tested in production environments and used in real-world analyses with optimal results. Its programming model acts by creating a computation graph with the operations issued by the user and executing it lazily only when the final results are queried. It has always been oriented towards parallelisation, with native support for multi-thread execution on a single machine. Recently, RDataFrame has been extended with a Python layer that is capable of steering and executing the RDataFrame computation graph over a set of distributed resources. In addition, such a layer requires minimal code changes for an RDataFrame application to run distributedly. The new tool effectively allows running a C++ event loop based on RDataFrame while leveraging common industry tools like Dask to schedule the usage of resources. This work presents results and insights gathered through the distributed RDataFrame tool running a physics analysis connecting multiple nodes with a Dask scheduler that requests resources from a Slurm cluster.oai:cds.cern.ch:28718152023
spellingShingle Computing and Computers
Padulano, V E
Kabadzhov, I D
Saavedra, E T
Guiraud, E
Leveraging HPC resources with distributed RDataFrame
title Leveraging HPC resources with distributed RDataFrame
title_full Leveraging HPC resources with distributed RDataFrame
title_fullStr Leveraging HPC resources with distributed RDataFrame
title_full_unstemmed Leveraging HPC resources with distributed RDataFrame
title_short Leveraging HPC resources with distributed RDataFrame
title_sort leveraging hpc resources with distributed rdataframe
topic Computing and Computers
url https://dx.doi.org/10.1088/1742-6596/2438/1/012097
http://cds.cern.ch/record/2871815
work_keys_str_mv AT padulanove leveraginghpcresourceswithdistributedrdataframe
AT kabadzhovid leveraginghpcresourceswithdistributedrdataframe
AT saavedraet leveraginghpcresourceswithdistributedrdataframe
AT guiraude leveraginghpcresourceswithdistributedrdataframe