Cargando…

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navig...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hwang, Jisu, Kim, Incheol
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7867342/ https://www.ncbi.nlm.nih.gov/pubmed/33540789 http://dx.doi.org/10.3390/s21031012

_version_	1783648280915214336
author	Hwang, Jisu Kim, Incheol
author_facet	Hwang, Jisu Kim, Incheol
author_sort	Hwang, Jisu
collection	PubMed
description	Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.
format	Online Article Text
id	pubmed-7867342
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-78673422021-02-07 Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation Hwang, Jisu Kim, Incheol Sensors (Basel) Article Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets. MDPI 2021-02-02 /pmc/articles/PMC7867342/ /pubmed/33540789 http://dx.doi.org/10.3390/s21031012 Text en © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Hwang, Jisu Kim, Incheol Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_full	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_fullStr	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_full_unstemmed	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_short	Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
title_sort	joint multimodal embedding and backtracking search in vision-and-language navigation
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7867342/ https://www.ncbi.nlm.nih.gov/pubmed/33540789 http://dx.doi.org/10.3390/s21031012
work_keys_str_mv	AT hwangjisu jointmultimodalembeddingandbacktrackingsearchinvisionandlanguagenavigation AT kimincheol jointmultimodalembeddingandbacktrackingsearchinvisionandlanguagenavigation

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Ejemplares similares