Cargando…

Building a Vietnamese Dataset for Natural Language Inference Models

Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential fo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nguyen, Chinh Trong, Nguyen, Dang Tuan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Nature Singapore 2022
Materias:	Original Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311348/ https://www.ncbi.nlm.nih.gov/pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x

Descripción
Sumario:	Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential for building state-of-the-art models. Therefore, we propose a method to build a Vietnamese dataset for training Vietnamese inference models which work on native Vietnamese texts. Our approach aims at two issues: removing cue marks and ensuring the writing style of Vietnamese texts. If a dataset contains cue marks, the trained models will identify the relationship between a premise and a hypothesis without semantic computation. For evaluation, we fine-tuned a BERT model, viNLI, on our dataset and compared it to a BERT model, viXNLI, which was fine-tuned on XNLI dataset. The viNLI model has an accuracy of 94.79%, while the viXNLI model has an accuracy of 64.04% when testing on our Vietnamese test set. In addition, we also conducted an answer selection experiment with these two models in which the P@1 of viNLI and of viXNLI are 0.4949 and 0.4044, respectively. That means our method can be used to build a high-quality Vietnamese natural language inference dataset.

Building a Vietnamese Dataset for Natural Language Inference Models

Ejemplares similares