Cargando…
Building a Vietnamese Dataset for Natural Language Inference Models
Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential fo...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer Nature Singapore
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311348/ https://www.ncbi.nlm.nih.gov/pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x |
_version_ | 1784753580531515392 |
---|---|
author | Nguyen, Chinh Trong Nguyen, Dang Tuan |
author_facet | Nguyen, Chinh Trong Nguyen, Dang Tuan |
author_sort | Nguyen, Chinh Trong |
collection | PubMed |
description | Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential for building state-of-the-art models. Therefore, we propose a method to build a Vietnamese dataset for training Vietnamese inference models which work on native Vietnamese texts. Our approach aims at two issues: removing cue marks and ensuring the writing style of Vietnamese texts. If a dataset contains cue marks, the trained models will identify the relationship between a premise and a hypothesis without semantic computation. For evaluation, we fine-tuned a BERT model, viNLI, on our dataset and compared it to a BERT model, viXNLI, which was fine-tuned on XNLI dataset. The viNLI model has an accuracy of 94.79%, while the viXNLI model has an accuracy of 64.04% when testing on our Vietnamese test set. In addition, we also conducted an answer selection experiment with these two models in which the P@1 of viNLI and of viXNLI are 0.4949 and 0.4044, respectively. That means our method can be used to build a high-quality Vietnamese natural language inference dataset. |
format | Online Article Text |
id | pubmed-9311348 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Springer Nature Singapore |
record_format | MEDLINE/PubMed |
spelling | pubmed-93113482022-07-26 Building a Vietnamese Dataset for Natural Language Inference Models Nguyen, Chinh Trong Nguyen, Dang Tuan SN Comput Sci Original Research Natural language inference models are essential resources for many natural language understanding applications. These models are possibly built by training or fine-tuning using deep neural network architectures for state-of-the-art results. That means high-quality annotated datasets are essential for building state-of-the-art models. Therefore, we propose a method to build a Vietnamese dataset for training Vietnamese inference models which work on native Vietnamese texts. Our approach aims at two issues: removing cue marks and ensuring the writing style of Vietnamese texts. If a dataset contains cue marks, the trained models will identify the relationship between a premise and a hypothesis without semantic computation. For evaluation, we fine-tuned a BERT model, viNLI, on our dataset and compared it to a BERT model, viXNLI, which was fine-tuned on XNLI dataset. The viNLI model has an accuracy of 94.79%, while the viXNLI model has an accuracy of 64.04% when testing on our Vietnamese test set. In addition, we also conducted an answer selection experiment with these two models in which the P@1 of viNLI and of viXNLI are 0.4949 and 0.4044, respectively. That means our method can be used to build a high-quality Vietnamese natural language inference dataset. Springer Nature Singapore 2022-07-25 2022 /pmc/articles/PMC9311348/ /pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x Text en © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd 2022 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Original Research Nguyen, Chinh Trong Nguyen, Dang Tuan Building a Vietnamese Dataset for Natural Language Inference Models |
title | Building a Vietnamese Dataset for Natural Language Inference Models |
title_full | Building a Vietnamese Dataset for Natural Language Inference Models |
title_fullStr | Building a Vietnamese Dataset for Natural Language Inference Models |
title_full_unstemmed | Building a Vietnamese Dataset for Natural Language Inference Models |
title_short | Building a Vietnamese Dataset for Natural Language Inference Models |
title_sort | building a vietnamese dataset for natural language inference models |
topic | Original Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9311348/ https://www.ncbi.nlm.nih.gov/pubmed/35911435 http://dx.doi.org/10.1007/s42979-022-01267-x |
work_keys_str_mv | AT nguyenchinhtrong buildingavietnamesedatasetfornaturallanguageinferencemodels AT nguyendangtuan buildingavietnamesedatasetfornaturallanguageinferencemodels |