Cargando…

Building a Vietnamese Dataset for Natural Language Inference Models

Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential fo...

Descripción completa

Detalles Bibliográficos
Autores principales: Nguyen, Chinh Trong, Nguyen, Dang Tuan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Nature Singapore 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311348/
https://www.ncbi.nlm.nih.gov/pubmed/35911435
http://dx.doi.org/10.1007/s42979-022-01267-x
_version_ 1784753580531515392
author Nguyen, Chinh Trong
Nguyen, Dang Tuan
author_facet Nguyen, Chinh Trong
Nguyen, Dang Tuan
author_sort Nguyen, Chinh Trong
collection PubMed
description Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential for building state-of-the-art models. Therefore, we propose a method to build a Vietnamese dataset for training Vietnamese inference models which work on native Vietnamese texts. Our approach aims at two issues: removing cue marks and ensuring the writing style of Vietnamese texts. If a dataset contains cue marks, the trained models will identify the relationship between a premise and a hypothesis without semantic computation. For evaluation, we fine-tuned a BERT model, viNLI, on our dataset and compared it to a BERT model, viXNLI, which was fine-tuned on XNLI dataset. The viNLI model has an accuracy of 94.79%, while the viXNLI model has an accuracy of 64.04% when testing on our Vietnamese test set. In addition, we also conducted an answer selection experiment with these two models in which the P@1 of viNLI and of viXNLI are 0.4949 and 0.4044, respectively. That means our method can be used to build a high-quality Vietnamese natural language inference dataset.
format Online
Article
Text
id pubmed-9311348
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Springer Nature Singapore
record_format MEDLINE/PubMed
spelling pubmed-93113482022-07-26 Building a Vietnamese Dataset for Natural Language Inference Models Nguyen, Chinh Trong Nguyen, Dang Tuan SN Comput Sci Original Research Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential for building state-of-the-art models. Therefore, we propose a method to build a Vietnamese dataset for training Vietnamese inference models which work on native Vietnamese texts. Our approach aims at two issues: removing cue marks and ensuring the writing style of Vietnamese texts. If a dataset contains cue marks, the trained models will identify the relationship between a premise and a hypothesis without semantic computation. For evaluation, we fine-tuned a BERT model, viNLI, on our dataset and compared it to a BERT model, viXNLI, which was fine-tuned on XNLI dataset. The viNLI model has an accuracy of 94.79%, while the viXNLI model has an accuracy of 64.04% when testing on our Vietnamese test set. In addition, we also conducted an answer selection experiment with these two models in which the P@1 of viNLI and of viXNLI are 0.4949 and 0.4044, respectively. That means our method can be used to build a high-quality Vietnamese natural language inference dataset. Springer Nature Singapore 2022-07-25 2022 /pmc/articles/PMC9311348/ /pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x Text en © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Original Research
Nguyen, Chinh Trong
Nguyen, Dang Tuan
Building a Vietnamese Dataset for Natural Language Inference Models
title Building a Vietnamese Dataset for Natural Language Inference Models
title_full Building a Vietnamese Dataset for Natural Language Inference Models
title_fullStr Building a Vietnamese Dataset for Natural Language Inference Models
title_full_unstemmed Building a Vietnamese Dataset for Natural Language Inference Models
title_short Building a Vietnamese Dataset for Natural Language Inference Models
title_sort building a vietnamese dataset for natural language inference models
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311348/
https://www.ncbi.nlm.nih.gov/pubmed/35911435
http://dx.doi.org/10.1007/s42979-022-01267-x
work_keys_str_mv AT nguyenchinhtrong buildingavietnamesedatasetfornaturallanguageinferencemodels
AT nguyendangtuan buildingavietnamesedatasetfornaturallanguageinferencemodels