Cargando…
DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 preci...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295351/ http://dx.doi.org/10.1007/978-3-030-50743-5_12 |
_version_ | 1783546634316021760 |
---|---|
author | Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki |
author_facet | Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki |
author_sort | Mukunoki, Daichi |
collection | PubMed |
description | This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. |
format | Online Article Text |
id | pubmed-7295351 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-72953512020-06-16 DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki High Performance Computing Article This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. 2020-05-22 /pmc/articles/PMC7295351/ http://dx.doi.org/10.1007/978-3-030-50743-5_12 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions |
title | DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions |
title_full | DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions |
title_fullStr | DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions |
title_full_unstemmed | DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions |
title_short | DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions |
title_sort | dgemm using tensor cores, and its accurate and reproducible versions |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295351/ http://dx.doi.org/10.1007/978-3-030-50743-5_12 |
work_keys_str_mv | AT mukunokidaichi dgemmusingtensorcoresanditsaccurateandreproducibleversions AT ozakikatsuhisa dgemmusingtensorcoresanditsaccurateandreproducibleversions AT ogitatakeshi dgemmusingtensorcoresanditsaccurateandreproducibleversions AT imamuratoshiyuki dgemmusingtensorcoresanditsaccurateandreproducibleversions |