Cargando…

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 preci...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mukunoki, Daichi, Ozaki, Katsuhisa, Ogita, Takeshi, Imamura, Toshiyuki
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295351/ http://dx.doi.org/10.1007/978-3-030-50743-5_12

_version_	1783546634316021760
author	Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki
author_facet	Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki
author_sort	Mukunoki, Daichi
collection	PubMed
description	This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.
format	Online Article Text
id	pubmed-7295351
institution	National Center for Biotechnology Information
language	English
publishDate	2020
record_format	MEDLINE/PubMed
spelling	pubmed-72953512020-06-16 DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki High Performance Computing Article This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform [Formula: see text] matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads. 2020-05-22 /pmc/articles/PMC7295351/ http://dx.doi.org/10.1007/978-3-030-50743-5_12 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Article Mukunoki, Daichi Ozaki, Katsuhisa Ogita, Takeshi Imamura, Toshiyuki DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title	DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_full	DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_fullStr	DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_full_unstemmed	DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_short	DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions
title_sort	dgemm using tensor cores, and its accurate and reproducible versions
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7295351/ http://dx.doi.org/10.1007/978-3-030-50743-5_12
work_keys_str_mv	AT mukunokidaichi dgemmusingtensorcoresanditsaccurateandreproducibleversions AT ozakikatsuhisa dgemmusingtensorcoresanditsaccurateandreproducibleversions AT ogitatakeshi dgemmusingtensorcoresanditsaccurateandreproducibleversions AT imamuratoshiyuki dgemmusingtensorcoresanditsaccurateandreproducibleversions

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions

Ejemplares similares