Cargando…

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 preci...

Descripción completa

Detalles Bibliográficos
Autores principales: Mukunoki, Daichi, Ozaki, Katsuhisa, Ogita, Takeshi, Imamura, Toshiyuki
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295351/
http://dx.doi.org/10.1007/978-3-030-50743-5_12
_version_ 1783546634316021760
author Mukunoki, Daichi
Ozaki, Katsuhisa
Ogita, Takeshi
Imamura, Toshiyuki
author_facet Mukunoki, Daichi
Ozaki, Katsuhisa
Ogita, Takeshi
Imamura, Toshiyuki
author_sort Mukunoki, Daichi
collection PubMed
description This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.
format Online
Article
Text
id pubmed-7295351
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72953512020-06-16 DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki High Performance Computing Article This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. 2020-05-22 /pmc/articles/PMC7295351/ http://dx.doi.org/10.1007/978-3-030-50743-5_12 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Mukunoki, Daichi
Ozaki, Katsuhisa
Ogita, Takeshi
Imamura, Toshiyuki
DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_full DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_fullStr DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_full_unstemmed DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_short DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_sort dgemm using tensor cores, and its accurate and reproducible versions
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295351/
http://dx.doi.org/10.1007/978-3-030-50743-5_12
work_keys_str_mv AT mukunokidaichi dgemmusingtensorcoresanditsaccurateandreproducibleversions
AT ozakikatsuhisa dgemmusingtensorcoresanditsaccurateandreproducibleversions
AT ogitatakeshi dgemmusingtensorcoresanditsaccurateandreproducibleversions
AT imamuratoshiyuki dgemmusingtensorcoresanditsaccurateandreproducibleversions