Cargando…

Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT

Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cho, Ikhyun, Kang, U
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2022
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8856529/ https://www.ncbi.nlm.nih.gov/pubmed/35180258 http://dx.doi.org/10.1371/journal.pone.0263592

_version_	1784653867443552256
author	Cho, Ikhyun Kang, U
author_facet	Cho, Ikhyun Kang, U
author_sort	Cho, Ikhyun
collection	PubMed
description	Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). Using this combination, we are capable of alleviating the KD’s limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model’s performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model’s performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins.
format	Online Article Text
id	pubmed-8856529
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-88565292022-02-19 Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT Cho, Ikhyun Kang, U PLoS One Research Article Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). Using this combination, we are capable of alleviating the KD’s limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model’s performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model’s performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins. Public Library of Science 2022-02-18 /pmc/articles/PMC8856529/ /pubmed/35180258 http://dx.doi.org/10.1371/journal.pone.0263592 Text en © 2022 Cho, Kang https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Cho, Ikhyun Kang, U Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title	Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_full	Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_fullStr	Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_full_unstemmed	Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_short	Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_sort	pea-kd: parameter-efficient and accurate knowledge distillation on bert
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8856529/ https://www.ncbi.nlm.nih.gov/pubmed/35180258 http://dx.doi.org/10.1371/journal.pone.0263592
work_keys_str_mv	AT choikhyun peakdparameterefficientandaccurateknowledgedistillationonbert AT kangu peakdparameterefficientandaccurateknowledgedistillationonbert

Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT

Ejemplares similares