Cargando…

A quantitative review of data formats for HEP analyses

The analysis of High Energy Physics (HEP) data sets often takes place outside the realm of experiment frameworks and central computing workflows, using carefully selected “n-tuples” or Analysis Object Data (AOD) as a data source. Such n-tuples or AODs may comprise data from tens of millions of event...

Descripción completa

Detalles Bibliográficos
Autor principal: Blomer, J
Lenguaje:eng
Publicado: 2018
Materias:
Acceso en línea:https://dx.doi.org/10.1088/1742-6596/1085/3/032020
http://cds.cern.ch/record/2665776
_version_ 1780962050049048576
author Blomer, J
author_facet Blomer, J
author_sort Blomer, J
collection CERN
description The analysis of High Energy Physics (HEP) data sets often takes place outside the realm of experiment frameworks and central computing workflows, using carefully selected “n-tuples” or Analysis Object Data (AOD) as a data source. Such n-tuples or AODs may comprise data from tens of millions of events and grow to hundred gigabytes or a few terabytes in size. They are typically small enough to be processed by an institute’s cluster or even by a single workstation. N-tuples and AODs are often stored in the ROOT file format, in an array of serialized C++ objects in columnar storage layout. In recent years, several new data formats emerged from the data analytics industry. We provide a quantitative comparison of ROOT and other popular data formats, such as Apache Parquet, Apache Avro, Google Protobuf, and HDF5. We compare speed, read patterns, and usage aspects for the use case of a typical LHC end-user n-tuple analysis. The performance characteristics of the relatively simple n-tuple data layout also provides a basis for understanding performance of more complex and nested data layouts. From the benchmarks, we derive performance tuning suggestions both for the use of the data formats and for the ROOT (de-)serialization code.
id oai-inspirehep.net-1699840
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2018
record_format invenio
spelling oai-inspirehep.net-16998402021-02-09T10:06:52Zdoi:10.1088/1742-6596/1085/3/032020http://cds.cern.ch/record/2665776engBlomer, JA quantitative review of data formats for HEP analysesComputing and ComputersThe analysis of High Energy Physics (HEP) data sets often takes place outside the realm of experiment frameworks and central computing workflows, using carefully selected “n-tuples” or Analysis Object Data (AOD) as a data source. Such n-tuples or AODs may comprise data from tens of millions of events and grow to hundred gigabytes or a few terabytes in size. They are typically small enough to be processed by an institute’s cluster or even by a single workstation. N-tuples and AODs are often stored in the ROOT file format, in an array of serialized C++ objects in columnar storage layout. In recent years, several new data formats emerged from the data analytics industry. We provide a quantitative comparison of ROOT and other popular data formats, such as Apache Parquet, Apache Avro, Google Protobuf, and HDF5. We compare speed, read patterns, and usage aspects for the use case of a typical LHC end-user n-tuple analysis. The performance characteristics of the relatively simple n-tuple data layout also provides a basis for understanding performance of more complex and nested data layouts. From the benchmarks, we derive performance tuning suggestions both for the use of the data formats and for the ROOT (de-)serialization code.oai:inspirehep.net:16998402018
spellingShingle Computing and Computers
Blomer, J
A quantitative review of data formats for HEP analyses
title A quantitative review of data formats for HEP analyses
title_full A quantitative review of data formats for HEP analyses
title_fullStr A quantitative review of data formats for HEP analyses
title_full_unstemmed A quantitative review of data formats for HEP analyses
title_short A quantitative review of data formats for HEP analyses
title_sort quantitative review of data formats for hep analyses
topic Computing and Computers
url https://dx.doi.org/10.1088/1742-6596/1085/3/032020
http://cds.cern.ch/record/2665776
work_keys_str_mv AT blomerj aquantitativereviewofdataformatsforhepanalyses
AT blomerj quantitativereviewofdataformatsforhepanalyses