Cargando…
Large-scale software optimization and micro-architectural specialization for accelerated high-performance computing
For almost half a century the performance of processors has been improving exponentially, closely following the observations made by Gordon Moore and Robert Dennard. One after the other, both predictions have come to a halt due to the increased complexity in transistor manufacturing, the power and t...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2022
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/2842047 |
Sumario: | For almost half a century the performance of processors has been improving exponentially, closely following the observations made by Gordon Moore and Robert Dennard. One after the other, both predictions have come to a halt due to the increased complexity in transistor manufacturing, the power and thermal limitations at extremely small-scale technology nodes, and the implications of Amdahl's law to multiprocessing . Nowadays, more than ever, the path to high performance passes through meticulous software optimization, fine tuning, and the design of domain-specific hardware architectures and accelerators. Within the scope of this thesis, we approach High-Performance Computing from two different standpoints. In the first part of the thesis, we bridge the gap between productivity-oriented, high-level programming languages and high-performance computing techniques. The domain of focus is particle accelerator physics, and more specifically beam dynamics. The state-of-art Beam Longitudinal Dynamics simulator BLonD was developed at CERN in 2014, and since then, BLonD has been driving the baseline choices for key-parameters related to the daily operation of the largest circular particle accelerators and their upgrades, as well the research for future machines. We develop a single-node optimized, multi-threaded version of BLonD to accommodate design-space exploration oriented simulation studies. Then, we build a hybrid, MPI-over-OpenMP version to bring the run-time of previously week-long or even month-long simulations down to a few hours. To achieve that, techniques such as intelligent dynamic load-balancing and approximate computing were employed. Finally, to anticipate the demand for ever-growing simulation workloads, we design a distributed, GPU-accelerated version of the code, which delivers more than two orders of magnitude improved latency and throughput compared to the previous state-of-art. All the above technologies and optimizations are developed in a user-friendly way. The dramatic reduction in execution time enables scientists to simulate beam longitudinal dynamics scenarios that combine more complex physics phenomena with finer resolution and larger number of simulated particles. These complex, accurate and fast simulations are essential in the field of beam dynamics to overcome current technological limitations, plan the upcoming upgrades of particle accelerators, and design future machines that will help science advance further. The second part of the thesis is focused on hardware customization to accommodate the needs of modern applications. GPUs, once used for the acceleration of graphic workloads, have now become the dominant platform for general purpose application acceleration. Their processing power and cost-efficiency have led to their adoption in almost every computing domain, including machine learning, scientific computing, and databases. Monitoring the behavior of multiple, GPU-accelerated workloads, the existence of a significant class of kernels was detected, which, due to limited data parallelism fail to support a large degree of Thread-Level Parallelism and hide the latency of memory operations. These kernels seek for more aggressive Instruction-Level Parallelism strategies to improve stall hiding and fill the execution pipeline. This inefficiency is addressed by designing a novel, light-weight Out-Of-Order GPU (LOOG) micro-architecture. LOOG is designed to re-use and re-purpose existing hardware components to minimize the power and area overheads. By exploiting Instruction-Level Parallelism to complement the existing Thread-Level Parallelism execution model, LOOG surpasses both traditional GPU platforms and other prior-art policies. A thorough discussion of LOOG internals and the key design tradeoffs that had to be considered are provided in the thesis. Moreover, an extensive design space exploration is performed to fine tune LOOG and demonstrate its effectiveness when applied on top of a variety of GPU platforms. The LOOG mechanism outperforms conventional platforms by 27.6% and 22.4% in terms of run-time and energy efficiency, respectively. This is a strong indication that LOOG is a promising alternative GPU micro-architecture, which is capable of expanding the applicability of future GPU platforms even further, to new application domains. To summarize, this thesis proposes two approaches to improve the performance in terms of execution time and energy efficiency, anticipating the ever-increasing computing requirements of modern applications. Firstly, we discuss meticulous software customization to take advantage of existing multi-processors and hardware accelerators, while providing an easy-to-use interface to the user-base. Secondly, we explore micro-architectural specializations to adjust to the needs of modern workloads. |
---|