Cargando…

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)(TM) Streaming-Aggregation Hardware Design and Evaluation

This paper describes the new hardware-based streaming-aggregation capability added to Mellanox’s Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches. For large messages, this capability is designed to achieve reduction bandwidths similar to those of point-to-point...

Descripción completa

Detalles Bibliográficos
Autores principales: Graham, Richard L., Levi, Lion, Burredy, Devendar, Bloch, Gil, Shainer, Gilad, Cho, David, Elias, George, Klein, Daniel, Ladd, Joshua, Maor, Ophir, Marelli, Ami, Petrov, Valentin, Romlet, Evyatar, Qin, Yong, Zemah, Ido
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295336/
http://dx.doi.org/10.1007/978-3-030-50743-5_3
Descripción
Sumario:This paper describes the new hardware-based streaming-aggregation capability added to Mellanox’s Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches. For large messages, this capability is designed to achieve reduction bandwidths similar to those of point-to-point messages of the same size, and complements the latency-optimized low-latency aggregation reduction capabilities, aimed at small data reductions. MPI_Allreduce() bandwidth measured on an HDR InfiniBand based system achieves about 95% of network bandwidth. For medium and large data reduction this also improves the reduction bandwidth by a factor of 2–5 relative to host-based (e.g., software-based) reduction algorithms. Using this capability also increased DL-Poly and PyTorch application performance by as much as 4% and 18%, respectively. This paper describes SHARP Streaming-Aggregation hardware architecture and a set of synthetic and application benchmarks used to study this new reduction capability, and the range of data sizes for which Streaming-Aggregation performs better than the low-latency aggregation algorithm.