Cargando…

Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS

Software engineers have been utilising Parallel Computing on General Purpose Graphics Processing Units (GPGPUs) in order to distribute the computing load on multiple processing units to meet increasing demand of processing powers. In order to get maximum performance from GPUs, researchers need to un...

Descripción completa

Detalles Bibliográficos
Autor principal: Mohamed, Abdulla
Lenguaje:eng
Publicado: 2020
Materias:
Acceso en línea:http://cds.cern.ch/record/2725035
Descripción
Sumario:Software engineers have been utilising Parallel Computing on General Purpose Graphics Processing Units (GPGPUs) in order to distribute the computing load on multiple processing units to meet increasing demand of processing powers. In order to get maximum performance from GPUs, researchers need to understand the architecture on the modern GPUs, how to optimise their programs to maximise the GPU utilisation, and how to measure the performance of GPU programs by using performance profiling tools. The effectiveness of several GPU optimisation techniques is measured in this research through experimentations on the Data Acquisition (DAQ) system used by the Compact Muon Solenoid (CMS) experiment. Those techniques target memory access, control flow, and algorithmic optimisations. Multiple performance benchmarks are used in this research to compare the different GPU programs, such as the throughput and speedup. The benchmarking is done by using different performance profiling tools. The results show that using the GPU shared memory decreases the number of executed instructions and clock cycles by more than 4% and 12% respectively. Using coalesced memory access pattern reduced the number of executed instructions and clock cycles by more than 71% and 44% respectively. However, using the Structure of Arrays (SoA) increased the number of executed instructions and clock cycles by less than 6% and 4% respectively. Furthermore, optimising the control flow by reducing the number of diverged threads in the GPU reduced the number of executed instructions and clock cycles by more than 57% and 68% respectively. As an algorithmic optimisation a grid data structure is developed. The grid data structure reduced the number of executed instructions and clock cycles by more than 98% and 95% respectively. All the results are in comparison to the previous optimisation iteration. All the optimisations combined resulted in more than 13 times speedup of the selected program compared to the CPU performance.