Cargando…

Building a Vietnamese Dataset for Natural Language Inference Models

Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential fo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nguyen, Chinh Trong, Nguyen, Dang Tuan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Nature Singapore 2022
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311348/ https://www.ncbi.nlm.nih.gov/pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x

_version_	1784753580531515392
author	Nguyen, Chinh Trong Nguyen, Dang Tuan
author_facet	Nguyen, Chinh Trong Nguyen, Dang Tuan
author_sort	Nguyen, Chinh Trong
collection	PubMed
description	Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential for building state-of-the-art models. Therefore, we propose a method to build a Vietnamese dataset for training Vietnamese inference models which work on native Vietnamese texts. Our approach aims at two issues: removing cue marks and ensuring the writing style of Vietnamese texts. If a dataset contains cue marks, the trained models will identify the relationship between a premise and a hypothesis without semantic computation. For evaluation, we fine-tuned a BERT model, viNLI, on our dataset and compared it to a BERT model, viXNLI, which was fine-tuned on XNLI dataset. The viNLI model has an accuracy of 94.79%, while the viXNLI model has an accuracy of 64.04% when testing on our Vietnamese test set. In addition, we also conducted an answer selection experiment with these two models in which the P@1 of viNLI and of viXNLI are 0.4949 and 0.4044, respectively. That means our method can be used to build a high-quality Vietnamese natural language inference dataset.
format	Online Article Text
id	pubmed-9311348
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Springer Nature Singapore
record_format	MEDLINE/PubMed
spelling	pubmed-93113482022-07-26 Building a Vietnamese Dataset for Natural Language Inference Models Nguyen, Chinh Trong Nguyen, Dang Tuan SN Comput Sci Original Research Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential for building state-of-the-art models. Therefore, we propose a method to build a Vietnamese dataset for training Vietnamese inference models which work on native Vietnamese texts. Our approach aims at two issues: removing cue marks and ensuring the writing style of Vietnamese texts. If a dataset contains cue marks, the trained models will identify the relationship between a premise and a hypothesis without semantic computation. For evaluation, we fine-tuned a BERT model, viNLI, on our dataset and compared it to a BERT model, viXNLI, which was fine-tuned on XNLI dataset. The viNLI model has an accuracy of 94.79%, while the viXNLI model has an accuracy of 64.04% when testing on our Vietnamese test set. In addition, we also conducted an answer selection experiment with these two models in which the P@1 of viNLI and of viXNLI are 0.4949 and 0.4044, respectively. That means our method can be used to build a high-quality Vietnamese natural language inference dataset. Springer Nature Singapore 2022-07-25 2022 /pmc/articles/PMC9311348/ /pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x Text en © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	Original Research Nguyen, Chinh Trong Nguyen, Dang Tuan Building a Vietnamese Dataset for Natural Language Inference Models
title	Building a Vietnamese Dataset for Natural Language Inference Models
title_full	Building a Vietnamese Dataset for Natural Language Inference Models
title_fullStr	Building a Vietnamese Dataset for Natural Language Inference Models
title_full_unstemmed	Building a Vietnamese Dataset for Natural Language Inference Models
title_short	Building a Vietnamese Dataset for Natural Language Inference Models
title_sort	building a vietnamese dataset for natural language inference models
topic	Original Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311348/ https://www.ncbi.nlm.nih.gov/pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x
work_keys_str_mv	AT nguyenchinhtrong buildingavietnamesedatasetfornaturallanguageinferencemodels AT nguyendangtuan buildingavietnamesedatasetfornaturallanguageinferencemodels

Building a Vietnamese Dataset for Natural Language Inference Models

Ejemplares similares