Cargando…

Exploiting Apache Spark platform for CMS computing analytics

The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These record...

Descripción completa

Detalles Bibliográficos
Autores principales:	Meoni, Marco, Kuznetsov, Valentin, Menichetti, Luca, Rumševičius, Justinas, Boccali, Tommaso, Bonacorsi, Daniele
Lenguaje:	eng
Publicado:	2017
Materias:	physics.comp-ph Other Fields of Physics physics.data-an
Acceso en línea:	https://dx.doi.org/10.1088/1742-6596/1085/3/032055 http://cds.cern.ch/record/2295120

_version_	1780956670960074752
author	Meoni, Marco Kuznetsov, Valentin Menichetti, Luca Rumševičius, Justinas Boccali, Tommaso Bonacorsi, Daniele
author_facet	Meoni, Marco Kuznetsov, Valentin Menichetti, Luca Rumševičius, Justinas Boccali, Tommaso Bonacorsi, Daniele
author_sort	Meoni, Marco
collection	CERN
description	The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. We discuss two main use-cases CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role. We demonstrate that both Scala and Python (PySpark) APIs can be successfully used to execute extremely I/O intensive queries and provide valuable data insight from collected meta-data.
id	cern-2295120
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2017
record_format	invenio
spelling	cern-22951202022-03-04T03:05:33Zdoi:10.1088/1742-6596/1085/3/032055http://cds.cern.ch/record/2295120engMeoni, MarcoKuznetsov, ValentinMenichetti, LucaRumševičius, JustinasBoccali, TommasoBonacorsi, DanieleExploiting Apache Spark platform for CMS computing analyticsphysics.comp-phOther Fields of Physicsphysics.data-anOther Fields of PhysicsThe CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. We discuss two main use-cases CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role. We demonstrate that both Scala and Python (PySpark) APIs can be successfully used to execute extremely I/O intensive queries and provide valuable data insight from collected meta-data.arXiv:1711.00552oai:cds.cern.ch:22951202017-11-01
spellingShingle	physics.comp-ph Other Fields of Physics physics.data-an Other Fields of Physics Meoni, Marco Kuznetsov, Valentin Menichetti, Luca Rumševičius, Justinas Boccali, Tommaso Bonacorsi, Daniele Exploiting Apache Spark platform for CMS computing analytics
title	Exploiting Apache Spark platform for CMS computing analytics
title_full	Exploiting Apache Spark platform for CMS computing analytics
title_fullStr	Exploiting Apache Spark platform for CMS computing analytics
title_full_unstemmed	Exploiting Apache Spark platform for CMS computing analytics
title_short	Exploiting Apache Spark platform for CMS computing analytics
title_sort	exploiting apache spark platform for cms computing analytics
topic	physics.comp-ph Other Fields of Physics physics.data-an Other Fields of Physics
url	https://dx.doi.org/10.1088/1742-6596/1085/3/032055 http://cds.cern.ch/record/2295120
work_keys_str_mv	AT meonimarco exploitingapachesparkplatformforcmscomputinganalytics AT kuznetsovvalentin exploitingapachesparkplatformforcmscomputinganalytics AT menichettiluca exploitingapachesparkplatformforcmscomputinganalytics AT rumseviciusjustinas exploitingapachesparkplatformforcmscomputinganalytics AT boccalitommaso exploitingapachesparkplatformforcmscomputinganalytics AT bonacorsidaniele exploitingapachesparkplatformforcmscomputinganalytics

Exploiting Apache Spark platform for CMS computing analytics

Ejemplares similares