Cargando…

Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT

Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following...

Descripción completa

Detalles Bibliográficos
Autores principales: Cho, Ikhyun, Kang, U
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8856529/
https://www.ncbi.nlm.nih.gov/pubmed/35180258
http://dx.doi.org/10.1371/journal.pone.0263592
_version_ 1784653867443552256
author Cho, Ikhyun
Kang, U
author_facet Cho, Ikhyun
Kang, U
author_sort Cho, Ikhyun
collection PubMed
description Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). Using this combination, we are capable of alleviating the KD’s limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model’s performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model’s performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins.
format Online
Article
Text
id pubmed-8856529
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-88565292022-02-19 Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT Cho, Ikhyun Kang, U PLoS One Research Article Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). Using this combination, we are capable of alleviating the KD’s limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model’s performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model’s performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins. Public Library of Science 2022-02-18 /pmc/articles/PMC8856529/ /pubmed/35180258 http://dx.doi.org/10.1371/journal.pone.0263592 Text en © 2022 Cho, Kang https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Cho, Ikhyun
Kang, U
Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_full Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_fullStr Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_full_unstemmed Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_short Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
title_sort pea-kd: parameter-efficient and accurate knowledge distillation on bert
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8856529/
https://www.ncbi.nlm.nih.gov/pubmed/35180258
http://dx.doi.org/10.1371/journal.pone.0263592
work_keys_str_mv AT choikhyun peakdparameterefficientandaccurateknowledgedistillationonbert
AT kangu peakdparameterefficientandaccurateknowledgedistillationonbert