Cargando…

SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression

Given a pre-trained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pre-training language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Piao, Tairen, Cho, Ikhyun, Kang, U.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9015158/ https://www.ncbi.nlm.nih.gov/pubmed/35436295 http://dx.doi.org/10.1371/journal.pone.0265621

_version_	1784688329412837376
author	Piao, Tairen Cho, Ikhyun Kang, U.
author_facet	Piao, Tairen Cho, Ikhyun Kang, U.
author_sort	Piao, Tairen
collection	PubMed
description	Given a pre-trained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pre-training language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memory cost and long inference time. In this paper, we propose SensiMix (Sensitivity-Aware Mixed Precision Quantization), a novel quantization-based BERT compression method that considers the sensitivity of different modules of BERT. SensiMix effectively applies 8-bit index quantization and 1-bit value quantization to the sensitive and insensitive parts of BERT, maximizing the compression rate while minimizing the accuracy drop. We also propose three novel 1-bit training methods to minimize the accuracy drop: Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layer-wise Fine-tuning. Moreover, for fast inference, we apply FP16 general matrix multiplication (GEMM) and XNOR-Count GEMM for 8-bit and 1-bit quantization parts of the model, respectively. Experiments on four GLUE downstream tasks show that SensiMix compresses the original BERT model to an equally effective but lightweight one, reducing the model size by a factor of 8× and shrinking the inference time by around 80% without noticeable accuracy drop.
format	Online Article Text
id	pubmed-9015158
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-90151582022-04-19 SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression Piao, Tairen Cho, Ikhyun Kang, U. PLoS One Research Article Given a pre-trained BERT, how can we compress it to a fast and lightweight one while maintaining its accuracy? Pre-training language model, such as BERT, is effective for improving the performance of natural language processing (NLP) tasks. However, heavy models like BERT have problems of large memory cost and long inference time. In this paper, we propose SensiMix (Sensitivity-Aware Mixed Precision Quantization), a novel quantization-based BERT compression method that considers the sensitivity of different modules of BERT. SensiMix effectively applies 8-bit index quantization and 1-bit value quantization to the sensitive and insensitive parts of BERT, maximizing the compression rate while minimizing the accuracy drop. We also propose three novel 1-bit training methods to minimize the accuracy drop: Absolute Binary Weight Regularization, Prioritized Training, and Inverse Layer-wise Fine-tuning. Moreover, for fast inference, we apply FP16 general matrix multiplication (GEMM) and XNOR-Count GEMM for 8-bit and 1-bit quantization parts of the model, respectively. Experiments on four GLUE downstream tasks show that SensiMix compresses the original BERT model to an equally effective but lightweight one, reducing the model size by a factor of 8× and shrinking the inference time by around 80% without noticeable accuracy drop. Public Library of Science 2022-04-18 /pmc/articles/PMC9015158/ /pubmed/35436295 http://dx.doi.org/10.1371/journal.pone.0265621 Text en © 2022 Piao et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Piao, Tairen Cho, Ikhyun Kang, U. SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression
title	SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression
title_full	SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression
title_fullStr	SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression
title_full_unstemmed	SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression
title_short	SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression
title_sort	sensimix: sensitivity-aware 8-bit index & 1-bit value mixed precision quantization for bert compression
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9015158/ https://www.ncbi.nlm.nih.gov/pubmed/35436295 http://dx.doi.org/10.1371/journal.pone.0265621
work_keys_str_mv	AT piaotairen sensimixsensitivityaware8bitindex1bitvaluemixedprecisionquantizationforbertcompression AT choikhyun sensimixsensitivityaware8bitindex1bitvaluemixedprecisionquantizationforbertcompression AT kangu sensimixsensitivityaware8bitindex1bitvaluemixedprecisionquantizationforbertcompression

SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression

Ejemplares similares