Cargando…

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruni...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hawks, Benjamin, Duarte, Javier, Fraser, Nicholas J., Pappalardo, Alessandro, Tran, Nhan, Umuroglu, Yaman
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8299073/ https://www.ncbi.nlm.nih.gov/pubmed/34308339 http://dx.doi.org/10.3389/frai.2021.676564

_version_	1783726191345139712
author	Hawks, Benjamin Duarte, Javier Fraser, Nicholas J. Pappalardo, Alessandro Tran, Nhan Umuroglu, Yaman
author_facet	Hawks, Benjamin Duarte, Javier Fraser, Nicholas J. Pappalardo, Alessandro Tran, Nhan Umuroglu, Yaman
author_sort	Hawks, Benjamin
collection	PubMed
description	Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.
format	Online Article Text
id	pubmed-8299073
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-82990732021-07-24 Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference Hawks, Benjamin Duarte, Javier Fraser, Nicholas J. Pappalardo, Alessandro Tran, Nhan Umuroglu, Yaman Front Artif Intell Artificial Intelligence Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability. Frontiers Media S.A. 2021-07-09 /pmc/articles/PMC8299073/ /pubmed/34308339 http://dx.doi.org/10.3389/frai.2021.676564 Text en Copyright © 2021 Hawks, Duarte, Fraser, Pappalardo, Tran and Umuroglu. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Artificial Intelligence Hawks, Benjamin Duarte, Javier Fraser, Nicholas J. Pappalardo, Alessandro Tran, Nhan Umuroglu, Yaman Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_full	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_fullStr	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_full_unstemmed	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_short	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_sort	ps and qs: quantization-aware pruning for efficient low latency neural network inference
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8299073/ https://www.ncbi.nlm.nih.gov/pubmed/34308339 http://dx.doi.org/10.3389/frai.2021.676564
work_keys_str_mv	AT hawksbenjamin psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT duartejavier psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT frasernicholasj psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT pappalardoalessandro psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT trannhan psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT umurogluyaman psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Ejemplares similares