Cargando…
Large-scale Vietnamese point-of-interest classification using weak labeling
Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality....
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9780588/ https://www.ncbi.nlm.nih.gov/pubmed/36568578 http://dx.doi.org/10.3389/frai.2022.1020532 |
_version_ | 1784856868637638656 |
---|---|
author | Tran, Van Trung Le, Quang Dao Pham, Bao Son Luu, Viet Hung Bui, Quang Hung |
author_facet | Tran, Van Trung Le, Quang Dao Pham, Bao Son Luu, Viet Hung Bui, Quang Hung |
author_sort | Tran, Van Trung |
collection | PubMed |
description | Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%). |
format | Online Article Text |
id | pubmed-9780588 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-97805882022-12-24 Large-scale Vietnamese point-of-interest classification using weak labeling Tran, Van Trung Le, Quang Dao Pham, Bao Son Luu, Viet Hung Bui, Quang Hung Front Artif Intell Artificial Intelligence Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%). Frontiers Media S.A. 2022-12-09 /pmc/articles/PMC9780588/ /pubmed/36568578 http://dx.doi.org/10.3389/frai.2022.1020532 Text en Copyright © 2022 Tran, Le, Pham, Luu and Bui. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Artificial Intelligence Tran, Van Trung Le, Quang Dao Pham, Bao Son Luu, Viet Hung Bui, Quang Hung Large-scale Vietnamese point-of-interest classification using weak labeling |
title | Large-scale Vietnamese point-of-interest classification using weak labeling |
title_full | Large-scale Vietnamese point-of-interest classification using weak labeling |
title_fullStr | Large-scale Vietnamese point-of-interest classification using weak labeling |
title_full_unstemmed | Large-scale Vietnamese point-of-interest classification using weak labeling |
title_short | Large-scale Vietnamese point-of-interest classification using weak labeling |
title_sort | large-scale vietnamese point-of-interest classification using weak labeling |
topic | Artificial Intelligence |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9780588/ https://www.ncbi.nlm.nih.gov/pubmed/36568578 http://dx.doi.org/10.3389/frai.2022.1020532 |
work_keys_str_mv | AT tranvantrung largescalevietnamesepointofinterestclassificationusingweaklabeling AT lequangdao largescalevietnamesepointofinterestclassificationusingweaklabeling AT phambaoson largescalevietnamesepointofinterestclassificationusingweaklabeling AT luuviethung largescalevietnamesepointofinterestclassificationusingweaklabeling AT buiquanghung largescalevietnamesepointofinterestclassificationusingweaklabeling |