Cargando…

Minimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer System

ATLAS[1] experiment at LHC will use a PC-based read-out component called FELIX[2] to connect its Front-End Electronics to the Data Acquisition System. FELIX translates proprietary Front-End protocols to Ethernet and vice versa. Currently, FELIX makes use of parallel multi-threading to achieve the da...

Descripción completa

Detalles Bibliográficos
Autores principales: Leventis, Georgios, Schumacher, Jorn, Donszelmann, Mark
Lenguaje:eng
Publicado: 2019
Materias:
Acceso en línea:http://cds.cern.ch/record/2686197
_version_ 1780963533248266240
author Leventis, Georgios
Schumacher, Jorn
Donszelmann, Mark
author_facet Leventis, Georgios
Schumacher, Jorn
Donszelmann, Mark
author_sort Leventis, Georgios
collection CERN
description ATLAS[1] experiment at LHC will use a PC-based read-out component called FELIX[2] to connect its Front-End Electronics to the Data Acquisition System. FELIX translates proprietary Front-End protocols to Ethernet and vice versa. Currently, FELIX makes use of parallel multi-threading to achieve the data rate requirements. Being a non-redundant component of the critical infrastructure necessitates its monitoring. This includes, but is not limited to, package statistics, memory utilization, and data rate statistics. However, for these statistics to be of practical use, the parallel threads are required to intercommunicate. The FELIX monitoring implementation prior to this research utilized thread-safe queues to which data was pushed from the parallel threads. A central thread would extract and combine the queue contents. Enabling statistics would deteriorate the throughput rate by more than 500%. To minimize this performance hit to the greatest extent, we took advantage of the CPU’s micro-architecture. The focus was on hardware supported atomic operations. They are usually implemented with a load-link - store- conditional pair of instructions. These instructions guarantee that a value is only modified if no updates have occurred on that value since reading it. They are used to complement and/or replace parallel computing lock mechanisms. The aforementioned queue system was replaced with sets of C/C++ atomic variables and corresponding atomic functions, hereinafter referred to as atomics. Three implementations were measured. Implementation A had one set of atomic variables being accessed from all the parallel threads. Implementation B had a set of atomic variables for every thread. These sets were accumulated by a central thread. Implementation C was the same as implementation B but appropriate measures were taken to eliminate any cache invalidation implications. The compiler used during the measurements was GCC which partially supports the hardware (micro-architecture) optimizations for atomics. Implementations A and B resulted in negligible differences compared to the initial one. The gains were not consistent and less than 5%. Some benchmarks even showed deterioration of the performance. Implementation C (cache-optimized) yielded results with a performance improvement of up to 625% compared to the initial implementation. The data rate target was reached. Implementations similar to C in our research could benefit similar environments. The results presented exhibits the power of programming based on atomics. However, from the results, it is clear that the system architecture and cache hierarchy needs to be taken into account in this programming model. The paper details the challenges of atomics and how they were overcome in the implementation of the FELIX monitoring system.
id cern-2686197
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2019
record_format invenio
spelling cern-26861972020-03-24T14:37:14Zhttp://cds.cern.ch/record/2686197engLeventis, GeorgiosSchumacher, JornDonszelmann, MarkMinimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer SystemParticle Physics - ExperimentATLAS[1] experiment at LHC will use a PC-based read-out component called FELIX[2] to connect its Front-End Electronics to the Data Acquisition System. FELIX translates proprietary Front-End protocols to Ethernet and vice versa. Currently, FELIX makes use of parallel multi-threading to achieve the data rate requirements. Being a non-redundant component of the critical infrastructure necessitates its monitoring. This includes, but is not limited to, package statistics, memory utilization, and data rate statistics. However, for these statistics to be of practical use, the parallel threads are required to intercommunicate. The FELIX monitoring implementation prior to this research utilized thread-safe queues to which data was pushed from the parallel threads. A central thread would extract and combine the queue contents. Enabling statistics would deteriorate the throughput rate by more than 500%. To minimize this performance hit to the greatest extent, we took advantage of the CPU’s micro-architecture. The focus was on hardware supported atomic operations. They are usually implemented with a load-link - store- conditional pair of instructions. These instructions guarantee that a value is only modified if no updates have occurred on that value since reading it. They are used to complement and/or replace parallel computing lock mechanisms. The aforementioned queue system was replaced with sets of C/C++ atomic variables and corresponding atomic functions, hereinafter referred to as atomics. Three implementations were measured. Implementation A had one set of atomic variables being accessed from all the parallel threads. Implementation B had a set of atomic variables for every thread. These sets were accumulated by a central thread. Implementation C was the same as implementation B but appropriate measures were taken to eliminate any cache invalidation implications. The compiler used during the measurements was GCC which partially supports the hardware (micro-architecture) optimizations for atomics. Implementations A and B resulted in negligible differences compared to the initial one. The gains were not consistent and less than 5%. Some benchmarks even showed deterioration of the performance. Implementation C (cache-optimized) yielded results with a performance improvement of up to 625% compared to the initial implementation. The data rate target was reached. Implementations similar to C in our research could benefit similar environments. The results presented exhibits the power of programming based on atomics. However, from the results, it is clear that the system architecture and cache hierarchy needs to be taken into account in this programming model. The paper details the challenges of atomics and how they were overcome in the implementation of the FELIX monitoring system.ATL-DAQ-SLIDE-2019-520oai:cds.cern.ch:26861972019-08-09
spellingShingle Particle Physics - Experiment
Leventis, Georgios
Schumacher, Jorn
Donszelmann, Mark
Minimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer System
title Minimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer System
title_full Minimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer System
title_fullStr Minimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer System
title_full_unstemmed Minimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer System
title_short Minimizing CPU Utilization Requirements to Monitor an ATLAS Data Transfer System
title_sort minimizing cpu utilization requirements to monitor an atlas data transfer system
topic Particle Physics - Experiment
url http://cds.cern.ch/record/2686197
work_keys_str_mv AT leventisgeorgios minimizingcpuutilizationrequirementstomonitoranatlasdatatransfersystem
AT schumacherjorn minimizingcpuutilizationrequirementstomonitoranatlasdatatransfersystem
AT donszelmannmark minimizingcpuutilizationrequirementstomonitoranatlasdatatransfersystem