Cargando…

Distributed data analysis with ROOT RDataFrame

Widespread distributed processing of big datasets has been around for more than a decade now thanks to Hadoop, but only recently higher-level abstractions have been proposed for programmers to easily operate on those datasets, e.g. Spark. ROOT has joined that trend with its RDataFrame tool for decla...

Descripción completa

Detalles Bibliográficos
Autores principales: Padulano, Vincenzo Eduardo, Villanueva, Javier Cervantes, Guiraud, Enrico, Tejedor Saavedra, Enric
Lenguaje:eng
Publicado: 2020
Materias:
Acceso en línea:https://dx.doi.org/10.1051/epjconf/202024503009
http://cds.cern.ch/record/2753977
_version_ 1780969440250167296
author Padulano, Vincenzo Eduardo
Villanueva, Javier Cervantes
Guiraud, Enrico
Tejedor Saavedra, Enric
author_facet Padulano, Vincenzo Eduardo
Villanueva, Javier Cervantes
Guiraud, Enrico
Tejedor Saavedra, Enric
author_sort Padulano, Vincenzo Eduardo
collection CERN
description Widespread distributed processing of big datasets has been around for more than a decade now thanks to Hadoop, but only recently higher-level abstractions have been proposed for programmers to easily operate on those datasets, e.g. Spark. ROOT has joined that trend with its RDataFrame tool for declarative analysis, which currently supports local multi-threaded parallelisation. However, RDataFrame’s programming model is general enough to accommodate multiple implementations or backends: users could write their code once and execute it as-is locally or distributedly, just by selecting the corresponding backend.This abstract introduces PyRDF, a new python library developed on top of RDataFrame to seamlessly switch from local to distributed environments with no changes in the application code. In addition, PyRDF has been integrated with a service for web-based analysis, SWAN, where users can dynamically plug in new resources, as well as write, execute, monitor and debug distributed applications via an intuitive interface.
id oai-inspirehep.net-1832081
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2020
record_format invenio
spelling oai-inspirehep.net-18320812021-03-15T21:27:27Zdoi:10.1051/epjconf/202024503009http://cds.cern.ch/record/2753977engPadulano, Vincenzo EduardoVillanueva, Javier CervantesGuiraud, EnricoTejedor Saavedra, EnricDistributed data analysis with ROOT RDataFrameComputing and ComputersWidespread distributed processing of big datasets has been around for more than a decade now thanks to Hadoop, but only recently higher-level abstractions have been proposed for programmers to easily operate on those datasets, e.g. Spark. ROOT has joined that trend with its RDataFrame tool for declarative analysis, which currently supports local multi-threaded parallelisation. However, RDataFrame’s programming model is general enough to accommodate multiple implementations or backends: users could write their code once and execute it as-is locally or distributedly, just by selecting the corresponding backend.This abstract introduces PyRDF, a new python library developed on top of RDataFrame to seamlessly switch from local to distributed environments with no changes in the application code. In addition, PyRDF has been integrated with a service for web-based analysis, SWAN, where users can dynamically plug in new resources, as well as write, execute, monitor and debug distributed applications via an intuitive interface.oai:inspirehep.net:18320812020
spellingShingle Computing and Computers
Padulano, Vincenzo Eduardo
Villanueva, Javier Cervantes
Guiraud, Enrico
Tejedor Saavedra, Enric
Distributed data analysis with ROOT RDataFrame
title Distributed data analysis with ROOT RDataFrame
title_full Distributed data analysis with ROOT RDataFrame
title_fullStr Distributed data analysis with ROOT RDataFrame
title_full_unstemmed Distributed data analysis with ROOT RDataFrame
title_short Distributed data analysis with ROOT RDataFrame
title_sort distributed data analysis with root rdataframe
topic Computing and Computers
url https://dx.doi.org/10.1051/epjconf/202024503009
http://cds.cern.ch/record/2753977
work_keys_str_mv AT padulanovincenzoeduardo distributeddataanalysiswithrootrdataframe
AT villanuevajaviercervantes distributeddataanalysiswithrootrdataframe
AT guiraudenrico distributeddataanalysiswithrootrdataframe
AT tejedorsaavedraenric distributeddataanalysiswithrootrdataframe