Cargando…

Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS

Software engineers have been utilising Parallel Computing on General Purpose Graphics Processing Units (GPGPUs) in order to distribute the computing load on multiple processing units to meet increasing demand of processing powers. In order to get maximum performance from GPUs, researchers need to un...

Descripción completa

Detalles Bibliográficos
Autor principal: Mohamed, Abdulla
Lenguaje:eng
Publicado: 2020
Materias:
Acceso en línea:http://cds.cern.ch/record/2725035
_version_ 1780965995450466304
author Mohamed, Abdulla
author_facet Mohamed, Abdulla
author_sort Mohamed, Abdulla
collection CERN
description Software engineers have been utilising Parallel Computing on General Purpose Graphics Processing Units (GPGPUs) in order to distribute the computing load on multiple processing units to meet increasing demand of processing powers. In order to get maximum performance from GPUs, researchers need to understand the architecture on the modern GPUs, how to optimise their programs to maximise the GPU utilisation, and how to measure the performance of GPU programs by using performance profiling tools. The effectiveness of several GPU optimisation techniques is measured in this research through experimentations on the Data Acquisition (DAQ) system used by the Compact Muon Solenoid (CMS) experiment. Those techniques target memory access, control flow, and algorithmic optimisations. Multiple performance benchmarks are used in this research to compare the different GPU programs, such as the throughput and speedup. The benchmarking is done by using different performance profiling tools. The results show that using the GPU shared memory decreases the number of executed instructions and clock cycles by more than 4% and 12% respectively. Using coalesced memory access pattern reduced the number of executed instructions and clock cycles by more than 71% and 44% respectively. However, using the Structure of Arrays (SoA) increased the number of executed instructions and clock cycles by less than 6% and 4% respectively. Furthermore, optimising the control flow by reducing the number of diverged threads in the GPU reduced the number of executed instructions and clock cycles by more than 57% and 68% respectively. As an algorithmic optimisation a grid data structure is developed. The grid data structure reduced the number of executed instructions and clock cycles by more than 98% and 95% respectively. All the results are in comparison to the previous optimisation iteration. All the optimisations combined resulted in more than 13 times speedup of the selected program compared to the CPU performance.
id cern-2725035
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2020
record_format invenio
spelling cern-27250352020-09-28T09:31:01Zhttp://cds.cern.ch/record/2725035engMohamed, AbdullaEnhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUSComputing and ComputersDetectors and Experimental TechniquesSoftware engineers have been utilising Parallel Computing on General Purpose Graphics Processing Units (GPGPUs) in order to distribute the computing load on multiple processing units to meet increasing demand of processing powers. In order to get maximum performance from GPUs, researchers need to understand the architecture on the modern GPUs, how to optimise their programs to maximise the GPU utilisation, and how to measure the performance of GPU programs by using performance profiling tools. The effectiveness of several GPU optimisation techniques is measured in this research through experimentations on the Data Acquisition (DAQ) system used by the Compact Muon Solenoid (CMS) experiment. Those techniques target memory access, control flow, and algorithmic optimisations. Multiple performance benchmarks are used in this research to compare the different GPU programs, such as the throughput and speedup. The benchmarking is done by using different performance profiling tools. The results show that using the GPU shared memory decreases the number of executed instructions and clock cycles by more than 4% and 12% respectively. Using coalesced memory access pattern reduced the number of executed instructions and clock cycles by more than 71% and 44% respectively. However, using the Structure of Arrays (SoA) increased the number of executed instructions and clock cycles by less than 6% and 4% respectively. Furthermore, optimising the control flow by reducing the number of diverged threads in the GPU reduced the number of executed instructions and clock cycles by more than 57% and 68% respectively. As an algorithmic optimisation a grid data structure is developed. The grid data structure reduced the number of executed instructions and clock cycles by more than 98% and 95% respectively. All the results are in comparison to the previous optimisation iteration. All the optimisations combined resulted in more than 13 times speedup of the selected program compared to the CPU performance.CERN-THESIS-2020-078oai:cds.cern.ch:27250352020-07-27T09:31:59Z
spellingShingle Computing and Computers
Detectors and Experimental Techniques
Mohamed, Abdulla
Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS
title Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS
title_full Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS
title_fullStr Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS
title_full_unstemmed Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS
title_short Enhancing CMS DAQ Systems Performance using Performance Profiling of Parallel Programs on GPGPUS
title_sort enhancing cms daq systems performance using performance profiling of parallel programs on gpgpus
topic Computing and Computers
Detectors and Experimental Techniques
url http://cds.cern.ch/record/2725035
work_keys_str_mv AT mohamedabdulla enhancingcmsdaqsystemsperformanceusingperformanceprofilingofparallelprogramsongpgpus