Cargando…

A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex

This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions event records, each of which consisting of ~100 bytes, all having the same probability to be searched or counted. Data f...

Descripción completa

Detalles Bibliográficos
Autores principales: Baranowski, Zbigniew, Barberis, Dario, Canali, Luca, Hrivnac, Julius, Toebbicke, Rainer
Lenguaje:eng
Publicado: 2016
Materias:
Acceso en línea:http://cds.cern.ch/record/2217221
_version_ 1780952089014304768
author Baranowski, Zbigniew
Barberis, Dario
Canali, Luca
Hrivnac, Julius
Toebbicke, Rainer
author_facet Baranowski, Zbigniew
Barberis, Dario
Canali, Luca
Hrivnac, Julius
Toebbicke, Rainer
author_sort Baranowski, Zbigniew
collection CERN
description This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions event records, each of which consisting of ~100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. This paper reports on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.
id cern-2217221
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2016
record_format invenio
spelling cern-22172212019-09-30T06:29:59Zhttp://cds.cern.ch/record/2217221engBaranowski, ZbigniewBarberis, DarioCanali, LucaHrivnac, JuliusToebbicke, RainerA study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndexParticle Physics - ExperimentThis paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions event records, each of which consisting of ~100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. This paper reports on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.ATL-SOFT-SLIDE-2016-680oai:cds.cern.ch:22172212016-09-21
spellingShingle Particle Physics - Experiment
Baranowski, Zbigniew
Barberis, Dario
Canali, Luca
Hrivnac, Julius
Toebbicke, Rainer
A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex
title A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex
title_full A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex
title_fullStr A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex
title_full_unstemmed A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex
title_short A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex
title_sort study of data representations in hadoop to optimize data storage and search performance of the atlas eventindex
topic Particle Physics - Experiment
url http://cds.cern.ch/record/2217221
work_keys_str_mv AT baranowskizbigniew astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT barberisdario astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT canaliluca astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT hrivnacjulius astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT toebbickerainer astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT baranowskizbigniew studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT barberisdario studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT canaliluca studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT hrivnacjulius studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex
AT toebbickerainer studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex