Cargando…

Sequence-to-sequence pretraining for a less-resourced Slovenian language

INTRODUCTION: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, whi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ulčar, Matej, Robnik-Šikonja, Marko
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2023
Materias:	Artificial Intelligence
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10086348/ https://www.ncbi.nlm.nih.gov/pubmed/37056912 http://dx.doi.org/10.3389/frai.2023.932519

_version_	1785022132286128128
author	Ulčar, Matej Robnik-Šikonja, Marko
author_facet	Ulčar, Matej Robnik-Šikonja, Marko
author_sort	Ulčar, Matej
collection	PubMed
description	INTRODUCTION: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages. METHODS: We trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model. RESULTS: Concerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models. DISCUSSION: While the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available.
format	Online Article Text
id	pubmed-10086348
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-100863482023-04-12 Sequence-to-sequence pretraining for a less-resourced Slovenian language Ulčar, Matej Robnik-Šikonja, Marko Front Artif Intell Artificial Intelligence INTRODUCTION: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages. METHODS: We trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model. RESULTS: Concerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models. DISCUSSION: While the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available. Frontiers Media S.A. 2023-03-28 /pmc/articles/PMC10086348/ /pubmed/37056912 http://dx.doi.org/10.3389/frai.2023.932519 Text en Copyright © 2023 Ulčar and Robnik-Šikonja. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Artificial Intelligence Ulčar, Matej Robnik-Šikonja, Marko Sequence-to-sequence pretraining for a less-resourced Slovenian language
title	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_full	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_fullStr	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_full_unstemmed	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_short	Sequence-to-sequence pretraining for a less-resourced Slovenian language
title_sort	sequence-to-sequence pretraining for a less-resourced slovenian language
topic	Artificial Intelligence
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10086348/ https://www.ncbi.nlm.nih.gov/pubmed/37056912 http://dx.doi.org/10.3389/frai.2023.932519
work_keys_str_mv	AT ulcarmatej sequencetosequencepretrainingforalessresourcedslovenianlanguage AT robniksikonjamarko sequencetosequencepretrainingforalessresourcedslovenianlanguage

Sequence-to-sequence pretraining for a less-resourced Slovenian language

Ejemplares similares