Cargando…
Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT
Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8856529/ https://www.ncbi.nlm.nih.gov/pubmed/35180258 http://dx.doi.org/10.1371/journal.pone.0263592 |
_version_ | 1784653867443552256 |
---|---|
author | Cho, Ikhyun Kang, U |
author_facet | Cho, Ikhyun Kang, U |
author_sort | Cho, Ikhyun |
collection | PubMed |
description | Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). Using this combination, we are capable of alleviating the KD’s limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model’s performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model’s performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins. |
format | Online Article Text |
id | pubmed-8856529 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-88565292022-02-19 Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT Cho, Ikhyun Kang, U PLoS One Research Article Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model’s level of performance as much as possible. However, existing KD methods suffer from the following limitations. First, since the student model is smaller in absolute size, it inherently lacks model capacity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). Using this combination, we are capable of alleviating the KD’s limitations. SPS is a new parameter sharing method that increases the student model capacity. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields a significant increase in student model’s performance. Experiments conducted on BERT with different datasets and tasks show that the proposed approach improves the student model’s performance by 4.4% on average in four GLUE tasks, outperforming existing KD baselines by significant margins. Public Library of Science 2022-02-18 /pmc/articles/PMC8856529/ /pubmed/35180258 http://dx.doi.org/10.1371/journal.pone.0263592 Text en © 2022 Cho, Kang https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Cho, Ikhyun Kang, U Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT |
title | Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT |
title_full | Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT |
title_fullStr | Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT |
title_full_unstemmed | Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT |
title_short | Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT |
title_sort | pea-kd: parameter-efficient and accurate knowledge distillation on bert |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8856529/ https://www.ncbi.nlm.nih.gov/pubmed/35180258 http://dx.doi.org/10.1371/journal.pone.0263592 |
work_keys_str_mv | AT choikhyun peakdparameterefficientandaccurateknowledgedistillationonbert AT kangu peakdparameterefficientandaccurateknowledgedistillationonbert |