Cargando…

Memory-Replay Knowledge Distillation

Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Jiyue, Zhang, Pei, Li, Yanxiong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8071405/ https://www.ncbi.nlm.nih.gov/pubmed/33921068 http://dx.doi.org/10.3390/s21082792

_version_	1783683692276744192
author	Wang, Jiyue Zhang, Pei Li, Yanxiong
author_facet	Wang, Jiyue Zhang, Pei Li, Yanxiong
author_sort	Wang, Jiyue
collection	PubMed
description	Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model’s output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher’s output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory.
format	Online Article Text
id	pubmed-8071405
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-80714052021-04-26 Memory-Replay Knowledge Distillation Wang, Jiyue Zhang, Pei Li, Yanxiong Sensors (Basel) Article Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model’s output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher’s output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory. MDPI 2021-04-15 /pmc/articles/PMC8071405/ /pubmed/33921068 http://dx.doi.org/10.3390/s21082792 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Wang, Jiyue Zhang, Pei Li, Yanxiong Memory-Replay Knowledge Distillation
title	Memory-Replay Knowledge Distillation
title_full	Memory-Replay Knowledge Distillation
title_fullStr	Memory-Replay Knowledge Distillation
title_full_unstemmed	Memory-Replay Knowledge Distillation
title_short	Memory-Replay Knowledge Distillation
title_sort	memory-replay knowledge distillation
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8071405/ https://www.ncbi.nlm.nih.gov/pubmed/33921068 http://dx.doi.org/10.3390/s21082792
work_keys_str_mv	AT wangjiyue memoryreplayknowledgedistillation AT zhangpei memoryreplayknowledgedistillation AT liyanxiong memoryreplayknowledgedistillation

Memory-Replay Knowledge Distillation

Ejemplares similares