Cargando…

Big Data tools as applied to ATLAS event data

Big Data technologies have proven to be very useful for storage, processing and visualization of derived metrics associated with ATLAS distributed computing (ADC) services. Logfiles, database records, and metadata from a diversity of systems have been aggregated and indexed to create an analytics pl...

Descripción completa

Detalles Bibliográficos
Autores principales: Vukotic, Ilija, Gardner, Robert, Bryant, Lincoln
Lenguaje:eng
Publicado: 2017
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/898/7/072003
http://cds.cern.ch/record/2240088
_version_ 1780952994960900096
author Vukotic, Ilija
Gardner, Robert
Bryant, Lincoln
author_facet Vukotic, Ilija
Gardner, Robert
Bryant, Lincoln
author_sort Vukotic, Ilija
collection CERN
description Big Data technologies have proven to be very useful for storage, processing and visualization of derived metrics associated with ATLAS distributed computing (ADC) services. Logfiles, database records, and metadata from a diversity of systems have been aggregated and indexed to create an analytics platform for ATLAS ADC operations analysis. Dashboards, wide area data access cost metrics, user analysis patterns, and resource utilization efficiency charts are produced flexibly through queries against a powerful analytics cluster. Here we explore whether these techniques and associated analytics ecosystem can be applied to add new modes of open, quick, and pervasive access to ATLAS event data. Such modes would simplify access and broaden the reach of ATLAS public data to new communities of users. An ability to efficiently store, filter, search and deliver ATLAS data at the event and/or sub-event level in a widely supported format would enable or significantly simplify usage of machine learning environments and tools like Spark, Jupyter, R, SciPy, Caffe, TensorFlow, etc. Machine learning challenges such as the Higgs Boson Machine Learning Challenge, the Tracking challenge, Event viewers (VP1, ATLANTIS, ATLASrift), and still to be developed educational and outreach tools would be able to access the data through a simple REST API. In this preliminary investigation we focus on derived xAOD data sets. These are much smaller than the primary xAODs having containers, variables, and events of interest to a particular analysis. Being encouraged with the performance of Elasticsearch for the ADC analytics platform, we developed an algorithm for indexing derived xAOD event data. We have made an appropriate document mapping and have imported a full set of standard model W/Z datasets. We compare the disk space efficiency of this approach to that of standard ROOT files, the performance in simple cut flow type of data analysis, and will present preliminary results on its scaling characteristics with different numbers of clients, query complexity, and size of the data retrieved.
id cern-2240088
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2017
record_format invenio
spelling cern-22400882019-10-15T15:19:09Zdoi:10.1088/1742-6596/898/7/072003http://cds.cern.ch/record/2240088engVukotic, IlijaGardner, RobertBryant, LincolnBig Data tools as applied to ATLAS event dataParticle Physics - ExperimentBig Data technologies have proven to be very useful for storage, processing and visualization of derived metrics associated with ATLAS distributed computing (ADC) services. Logfiles, database records, and metadata from a diversity of systems have been aggregated and indexed to create an analytics platform for ATLAS ADC operations analysis. Dashboards, wide area data access cost metrics, user analysis patterns, and resource utilization efficiency charts are produced flexibly through queries against a powerful analytics cluster. Here we explore whether these techniques and associated analytics ecosystem can be applied to add new modes of open, quick, and pervasive access to ATLAS event data. Such modes would simplify access and broaden the reach of ATLAS public data to new communities of users. An ability to efficiently store, filter, search and deliver ATLAS data at the event and/or sub-event level in a widely supported format would enable or significantly simplify usage of machine learning environments and tools like Spark, Jupyter, R, SciPy, Caffe, TensorFlow, etc. Machine learning challenges such as the Higgs Boson Machine Learning Challenge, the Tracking challenge, Event viewers (VP1, ATLANTIS, ATLASrift), and still to be developed educational and outreach tools would be able to access the data through a simple REST API. In this preliminary investigation we focus on derived xAOD data sets. These are much smaller than the primary xAODs having containers, variables, and events of interest to a particular analysis. Being encouraged with the performance of Elasticsearch for the ADC analytics platform, we developed an algorithm for indexing derived xAOD event data. We have made an appropriate document mapping and have imported a full set of standard model W/Z datasets. We compare the disk space efficiency of this approach to that of standard ROOT files, the performance in simple cut flow type of data analysis, and will present preliminary results on its scaling characteristics with different numbers of clients, query complexity, and size of the data retrieved.ATL-SOFT-PROC-2017-001oai:cds.cern.ch:22400882017-01-03
spellingShingle Particle Physics - Experiment
Vukotic, Ilija
Gardner, Robert
Bryant, Lincoln
Big Data tools as applied to ATLAS event data
title Big Data tools as applied to ATLAS event data
title_full Big Data tools as applied to ATLAS event data
title_fullStr Big Data tools as applied to ATLAS event data
title_full_unstemmed Big Data tools as applied to ATLAS event data
title_short Big Data tools as applied to ATLAS event data
title_sort big data tools as applied to atlas event data
topic Particle Physics - Experiment
url https://dx.doi.org/10.1088/1742-6596/898/7/072003
http://cds.cern.ch/record/2240088
work_keys_str_mv AT vukoticilija bigdatatoolsasappliedtoatlaseventdata
AT gardnerrobert bigdatatoolsasappliedtoatlaseventdata
AT bryantlincoln bigdatatoolsasappliedtoatlaseventdata