Cargando…

Domain Word Extension Using Curriculum Learning

Self-supervised learning models, such as BERT, have improved the performance of various tasks in natural language processing. Although the effect is reduced in the out-of-domain field and not the the trained domain thus representing a limitation, it is difficult to train a new language model for a s...

Descripción completa

Detalles Bibliográficos
Autores principales: Seong, Sujin, Cha, Jeongwon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10056774/
https://www.ncbi.nlm.nih.gov/pubmed/36991775
http://dx.doi.org/10.3390/s23063064
_version_ 1785016205881376768
author Seong, Sujin
Cha, Jeongwon
author_facet Seong, Sujin
Cha, Jeongwon
author_sort Seong, Sujin
collection PubMed
description Self-supervised learning models, such as BERT, have improved the performance of various tasks in natural language processing. Although the effect is reduced in the out-of-domain field and not the the trained domain thus representing a limitation, it is difficult to train a new language model for a specific domain since it is both time-consuming and requires large amounts of data. We propose a method to quickly and effectively apply the pre-trained language models trained in the general domain to a specific domain’s vocabulary without re-training. An extended vocabulary list is obtained by extracting a meaningful wordpiece from the training data of the downstream task. We introduce curriculum learning, training the models with two successive updates, to adapt the embedding value of the new vocabulary. It is convenient to apply because all training of the models for downstream tasks are performed in one run. To confirm the effectiveness of the proposed method, we conducted experiments on AIDA-SC, AIDA-FC, and KLUE-TC, which are Korean classification tasks, and subsequently achieved stable performance improvement.
format Online
Article
Text
id pubmed-10056774
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-100567742023-03-30 Domain Word Extension Using Curriculum Learning Seong, Sujin Cha, Jeongwon Sensors (Basel) Article Self-supervised learning models, such as BERT, have improved the performance of various tasks in natural language processing. Although the effect is reduced in the out-of-domain field and not the the trained domain thus representing a limitation, it is difficult to train a new language model for a specific domain since it is both time-consuming and requires large amounts of data. We propose a method to quickly and effectively apply the pre-trained language models trained in the general domain to a specific domain’s vocabulary without re-training. An extended vocabulary list is obtained by extracting a meaningful wordpiece from the training data of the downstream task. We introduce curriculum learning, training the models with two successive updates, to adapt the embedding value of the new vocabulary. It is convenient to apply because all training of the models for downstream tasks are performed in one run. To confirm the effectiveness of the proposed method, we conducted experiments on AIDA-SC, AIDA-FC, and KLUE-TC, which are Korean classification tasks, and subsequently achieved stable performance improvement. MDPI 2023-03-13 /pmc/articles/PMC10056774/ /pubmed/36991775 http://dx.doi.org/10.3390/s23063064 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Seong, Sujin
Cha, Jeongwon
Domain Word Extension Using Curriculum Learning
title Domain Word Extension Using Curriculum Learning
title_full Domain Word Extension Using Curriculum Learning
title_fullStr Domain Word Extension Using Curriculum Learning
title_full_unstemmed Domain Word Extension Using Curriculum Learning
title_short Domain Word Extension Using Curriculum Learning
title_sort domain word extension using curriculum learning
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10056774/
https://www.ncbi.nlm.nih.gov/pubmed/36991775
http://dx.doi.org/10.3390/s23063064
work_keys_str_mv AT seongsujin domainwordextensionusingcurriculumlearning
AT chajeongwon domainwordextensionusingcurriculumlearning