Cargando…

Large-scale Vietnamese point-of-interest classification using weak labeling

Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality....

Descripción completa

Detalles Bibliográficos
Autores principales: Tran, Van Trung, Le, Quang Dao, Pham, Bao Son, Luu, Viet Hung, Bui, Quang Hung
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9780588/
https://www.ncbi.nlm.nih.gov/pubmed/36568578
http://dx.doi.org/10.3389/frai.2022.1020532
_version_ 1784856868637638656
author Tran, Van Trung
Le, Quang Dao
Pham, Bao Son
Luu, Viet Hung
Bui, Quang Hung
author_facet Tran, Van Trung
Le, Quang Dao
Pham, Bao Son
Luu, Viet Hung
Bui, Quang Hung
author_sort Tran, Van Trung
collection PubMed
description Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%).
format Online
Article
Text
id pubmed-9780588
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-97805882022-12-24 Large-scale Vietnamese point-of-interest classification using weak labeling Tran, Van Trung Le, Quang Dao Pham, Bao Son Luu, Viet Hung Bui, Quang Hung Front Artif Intell Artificial Intelligence Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%). Frontiers Media S.A. 2022-12-09 /pmc/articles/PMC9780588/ /pubmed/36568578 http://dx.doi.org/10.3389/frai.2022.1020532 Text en Copyright © 2022 Tran, Le, Pham, Luu and Bui. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Artificial Intelligence
Tran, Van Trung
Le, Quang Dao
Pham, Bao Son
Luu, Viet Hung
Bui, Quang Hung
Large-scale Vietnamese point-of-interest classification using weak labeling
title Large-scale Vietnamese point-of-interest classification using weak labeling
title_full Large-scale Vietnamese point-of-interest classification using weak labeling
title_fullStr Large-scale Vietnamese point-of-interest classification using weak labeling
title_full_unstemmed Large-scale Vietnamese point-of-interest classification using weak labeling
title_short Large-scale Vietnamese point-of-interest classification using weak labeling
title_sort large-scale vietnamese point-of-interest classification using weak labeling
topic Artificial Intelligence
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9780588/
https://www.ncbi.nlm.nih.gov/pubmed/36568578
http://dx.doi.org/10.3389/frai.2022.1020532
work_keys_str_mv AT tranvantrung largescalevietnamesepointofinterestclassificationusingweaklabeling
AT lequangdao largescalevietnamesepointofinterestclassificationusingweaklabeling
AT phambaoson largescalevietnamesepointofinterestclassificationusingweaklabeling
AT luuviethung largescalevietnamesepointofinterestclassificationusingweaklabeling
AT buiquanghung largescalevietnamesepointofinterestclassificationusingweaklabeling