Cargando…

Memory-Replay Knowledge Distillation

Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD d...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Jiyue, Zhang, Pei, Li, Yanxiong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8071405/
https://www.ncbi.nlm.nih.gov/pubmed/33921068
http://dx.doi.org/10.3390/s21082792
_version_ 1783683692276744192
author Wang, Jiyue
Zhang, Pei
Li, Yanxiong
author_facet Wang, Jiyue
Zhang, Pei
Li, Yanxiong
author_sort Wang, Jiyue
collection PubMed
description Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model’s output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher’s output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory.
format Online
Article
Text
id pubmed-8071405
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-80714052021-04-26 Memory-Replay Knowledge Distillation Wang, Jiyue Zhang, Pei Li, Yanxiong Sensors (Basel) Article Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model’s output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher’s output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory. MDPI 2021-04-15 /pmc/articles/PMC8071405/ /pubmed/33921068 http://dx.doi.org/10.3390/s21082792 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Wang, Jiyue
Zhang, Pei
Li, Yanxiong
Memory-Replay Knowledge Distillation
title Memory-Replay Knowledge Distillation
title_full Memory-Replay Knowledge Distillation
title_fullStr Memory-Replay Knowledge Distillation
title_full_unstemmed Memory-Replay Knowledge Distillation
title_short Memory-Replay Knowledge Distillation
title_sort memory-replay knowledge distillation
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8071405/
https://www.ncbi.nlm.nih.gov/pubmed/33921068
http://dx.doi.org/10.3390/s21082792
work_keys_str_mv AT wangjiyue memoryreplayknowledgedistillation
AT zhangpei memoryreplayknowledgedistillation
AT liyanxiong memoryreplayknowledgedistillation