Cargando…

Preparing the CERN machine-learned particle-flow model for Exascale using Horovod: Experience and performance studies on the Flatiron and Jülich supercomputers

There is an increase in interest of showing the importance of HPC for Artificial In- telligence and Artificial Intelligence for HPC. HPC centers are now boasting larger and larger compute power, with more centers reaching exascale. CERN is invest- igating the usage of Artificial Intelligence and Mac...

Descripción completa

Detalles Bibliográficos
Autor principal: Sørlie, Lars
Lenguaje:eng
Publicado: NTNU 2022
Materias:
Acceso en línea:http://cds.cern.ch/record/2839735
Descripción
Sumario:There is an increase in interest of showing the importance of HPC for Artificial In- telligence and Artificial Intelligence for HPC. HPC centers are now boasting larger and larger compute power, with more centers reaching exascale. CERN is invest- igating the usage of Artificial Intelligence and Machine Learning to augment or replace some of the traditional workflows within the LHC experiments. Advant- ages of Machine Learning and Artificial Intelligence is their highly parallelizable nature on suitable hardware, such as GPUs. MLPF, like every other large scale model, utilizes large compute resources in or- der to become efficient and accurate. Models and their datasets keep increasing in size, further expanding their need for more computer resources. The work in this thesis includes implementing a distributed version of the Graph Neural Network MLPF using the Horovod framework with the aim to scale the applications to exascale-class supercomputers. Horovod is a well-established framework for distributed workloads within the Artificial Intelligence field. Our work uses the Horovod framework to distribute the work across up to 292 su- percomputer nodes each with up to 4 GPUs, i.e. runs with up to over 1100 GPUs on the Jülich Juwels supercomputer. Our work focuses on experiments on the Nvidia Volta and Ampere architecture, as these were the best available to us during this thesis. During these scaling tests we observe that the performance scales well up to 24 nodes. In particular, we see a speedup of up to 20X and on 24 nodes, whereas the speedup reduces to 50X when going to 100 nodes. The thesis also includes comparisons of scaling performances between the Volta and Ampere GPU architectures. Our results also show that the difference between using Volta and Ampere GPUs diminishes as one scales to many nodes, a tend- ency noticed after only 8 nodes a difference becomes only 36 seconds per epoch, whereas the timings on a single node is 485 seconds and 200 seconds, respect- ively for a single Volta and Ampere GPU. Some suggestions for future work are also included.