Cargando…

Utilizing GPU Performance Counters to Characterize GPU Kernels via Machine Learning

GPU computing kernels are relatively simple to write if achieving the best performance is not of the highest priority. However, it can quickly become a much more daunting task when users try to tune and optimize their kernels to obtain the highest performance. This is due to GPUs’ massive degree of...

Descripción completa

Detalles Bibliográficos
Autores principales: Zigon, Bob, Song, Fengguang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7302272/
http://dx.doi.org/10.1007/978-3-030-50371-0_7
Descripción
Sumario:GPU computing kernels are relatively simple to write if achieving the best performance is not of the highest priority. However, it can quickly become a much more daunting task when users try to tune and optimize their kernels to obtain the highest performance. This is due to GPUs’ massive degree of parallelism, complex memory hierarchy, fine grain synchronization, and long memory access latency. Hence, users must carry out the complex tasks of profiling, analyzing, and tuning to reduce performance bottlenecks. Today’s GPUs can generate hundreds of performance events that comprehensively quantify the behavior of a kernel. Instead of relying on experts’ manual analysis, this paper targets using machine learning methods to generalize GPU performance counter data to determine the characteristics of a GPU kernel as they will reveal possible reasons for low performance. We choose a set of problem-independent counters as our inputs to design and compare three machine learning methods to automatically classify the execution behavior of a kernel. The experimental results on stencil computing kernels and sparse matrix multiplications show the machine learning models’ good accuracy, and demonstrate a feasible approach that is capable of classifying a kernel’s characterizations and suggesting changes to a skilled user, who can subsequently improve kernel performance with less guessing.