Cargando…

A quantitative review of data formats for HEP analyses

The analysis of High Energy Physics (HEP) data sets often takes place outside the realm of experiment frameworks and central computing workflows, using carefully selected “n-tuples” or Analysis Object Data (AOD) as a data source. Such n-tuples or AODs may comprise data from tens of millions of event...

Descripción completa

Detalles Bibliográficos
Autor principal:	Blomer, J
Lenguaje:	eng
Publicado:	2018
Materias:	Computing and Computers
Acceso en línea:	https://dx.doi.org/10.1088/1742-6596/1085/3/032020 http://cds.cern.ch/record/2665776

_version_	1780962050049048576
author	Blomer, J
author_facet	Blomer, J
author_sort	Blomer, J
collection	CERN
description	The analysis of High Energy Physics (HEP) data sets often takes place outside the realm of experiment frameworks and central computing workflows, using carefully selected “n-tuples” or Analysis Object Data (AOD) as a data source. Such n-tuples or AODs may comprise data from tens of millions of events and grow to hundred gigabytes or a few terabytes in size. They are typically small enough to be processed by an institute’s cluster or even by a single workstation. N-tuples and AODs are often stored in the ROOT file format, in an array of serialized C++ objects in columnar storage layout. In recent years, several new data formats emerged from the data analytics industry. We provide a quantitative comparison of ROOT and other popular data formats, such as Apache Parquet, Apache Avro, Google Protobuf, and HDF5. We compare speed, read patterns, and usage aspects for the use case of a typical LHC end-user n-tuple analysis. The performance characteristics of the relatively simple n-tuple data layout also provides a basis for understanding performance of more complex and nested data layouts. From the benchmarks, we derive performance tuning suggestions both for the use of the data formats and for the ROOT (de-)serialization code.
id	oai-inspirehep.net-1699840
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2018
record_format	invenio
spelling	oai-inspirehep.net-16998402021-02-09T10:06:52Zdoi:10.1088/1742-6596/1085/3/032020http://cds.cern.ch/record/2665776engBlomer, JA quantitative review of data formats for HEP analysesComputing and ComputersThe analysis of High Energy Physics (HEP) data sets often takes place outside the realm of experiment frameworks and central computing workflows, using carefully selected “n-tuples” or Analysis Object Data (AOD) as a data source. Such n-tuples or AODs may comprise data from tens of millions of events and grow to hundred gigabytes or a few terabytes in size. They are typically small enough to be processed by an institute’s cluster or even by a single workstation. N-tuples and AODs are often stored in the ROOT file format, in an array of serialized C++ objects in columnar storage layout. In recent years, several new data formats emerged from the data analytics industry. We provide a quantitative comparison of ROOT and other popular data formats, such as Apache Parquet, Apache Avro, Google Protobuf, and HDF5. We compare speed, read patterns, and usage aspects for the use case of a typical LHC end-user n-tuple analysis. The performance characteristics of the relatively simple n-tuple data layout also provides a basis for understanding performance of more complex and nested data layouts. From the benchmarks, we derive performance tuning suggestions both for the use of the data formats and for the ROOT (de-)serialization code.oai:inspirehep.net:16998402018
spellingShingle	Computing and Computers Blomer, J A quantitative review of data formats for HEP analyses
title	A quantitative review of data formats for HEP analyses
title_full	A quantitative review of data formats for HEP analyses
title_fullStr	A quantitative review of data formats for HEP analyses
title_full_unstemmed	A quantitative review of data formats for HEP analyses
title_short	A quantitative review of data formats for HEP analyses
title_sort	quantitative review of data formats for hep analyses
topic	Computing and Computers
url	https://dx.doi.org/10.1088/1742-6596/1085/3/032020 http://cds.cern.ch/record/2665776
work_keys_str_mv	AT blomerj aquantitativereviewofdataformatsforhepanalyses AT blomerj quantitativereviewofdataformatsforhepanalyses

A quantitative review of data formats for HEP analyses

Ejemplares similares