Cargando…
A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex
This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions event records, each of which consisting of ~100 bytes, all having the same probability to be searched or counted. Data f...
Autores principales: | , , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2016
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2217221 |
_version_ | 1780952089014304768 |
---|---|
author | Baranowski, Zbigniew Barberis, Dario Canali, Luca Hrivnac, Julius Toebbicke, Rainer |
author_facet | Baranowski, Zbigniew Barberis, Dario Canali, Luca Hrivnac, Julius Toebbicke, Rainer |
author_sort | Baranowski, Zbigniew |
collection | CERN |
description | This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions event records, each of which consisting of ~100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. This paper reports on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports. |
id | cern-2217221 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2016 |
record_format | invenio |
spelling | cern-22172212019-09-30T06:29:59Zhttp://cds.cern.ch/record/2217221engBaranowski, ZbigniewBarberis, DarioCanali, LucaHrivnac, JuliusToebbicke, RainerA study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndexParticle Physics - ExperimentThis paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions event records, each of which consisting of ~100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. This paper reports on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.ATL-SOFT-SLIDE-2016-680oai:cds.cern.ch:22172212016-09-21 |
spellingShingle | Particle Physics - Experiment Baranowski, Zbigniew Barberis, Dario Canali, Luca Hrivnac, Julius Toebbicke, Rainer A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex |
title | A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex |
title_full | A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex |
title_fullStr | A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex |
title_full_unstemmed | A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex |
title_short | A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex |
title_sort | study of data representations in hadoop to optimize data storage and search performance of the atlas eventindex |
topic | Particle Physics - Experiment |
url | http://cds.cern.ch/record/2217221 |
work_keys_str_mv | AT baranowskizbigniew astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT barberisdario astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT canaliluca astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT hrivnacjulius astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT toebbickerainer astudyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT baranowskizbigniew studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT barberisdario studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT canaliluca studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT hrivnacjulius studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex AT toebbickerainer studyofdatarepresentationsinhadooptooptimizedatastorageandsearchperformanceoftheatlaseventindex |