Cargando…

Outdoor Vision-and-Language Navigation Needs Object-Level Alignment

In the field of embodied AI, vision-and-language navigation (VLN) is a crucial and challenging multi-modal task. Specifically, outdoor VLN involves an agent navigating within a graph-based environment, while simultaneously interpreting information from real-world urban environments and natural langu...

Descripción completa

Detalles Bibliográficos
Autores principales: Sun, Yanjun, Qiu, Yue, Aoki, Yoshimitsu, Kataoka, Hirokatsu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10346337/
https://www.ncbi.nlm.nih.gov/pubmed/37447877
http://dx.doi.org/10.3390/s23136028
_version_ 1785073291604525056
author Sun, Yanjun
Qiu, Yue
Aoki, Yoshimitsu
Kataoka, Hirokatsu
author_facet Sun, Yanjun
Qiu, Yue
Aoki, Yoshimitsu
Kataoka, Hirokatsu
author_sort Sun, Yanjun
collection PubMed
description In the field of embodied AI, vision-and-language navigation (VLN) is a crucial and challenging multi-modal task. Specifically, outdoor VLN involves an agent navigating within a graph-based environment, while simultaneously interpreting information from real-world urban environments and natural language instructions. Existing outdoor VLN models predict actions using a combination of panorama and instruction features. However, these methods may cause the agent to struggle to understand complicated outdoor environments and ignore the details in the environments to fail to navigate. Human navigation often involves the use of specific objects as reference landmarks when navigating to unfamiliar places, providing a more rational and efficient approach to navigation. Inspired by this natural human behavior, we propose an object-level alignment module (OAlM), which guides the agent to focus more on object tokens mentioned in the instructions and recognize these landmarks during navigation. By treating these landmarks as sub-goals, our method effectively decomposes a long-range path into a series of shorter paths, ultimately improving the agent’s overall performance. In addition to enabling better object recognition and alignment, our proposed OAlM also fosters a more robust and adaptable agent capable of navigating complex environments. This adaptability is particularly crucial for real-world applications where environmental conditions can be unpredictable and varied. Experimental results show our OAlM is a more object-focused model, and our approach outperforms all metrics on a challenging outdoor VLN Touchdown dataset, exceeding the baseline by 3.19% on task completion (TC). These results highlight the potential of leveraging object-level information in the form of sub-goals to improve navigation performance in embodied AI systems, paving the way for more advanced and efficient outdoor navigation.
format Online
Article
Text
id pubmed-10346337
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-103463372023-07-15 Outdoor Vision-and-Language Navigation Needs Object-Level Alignment Sun, Yanjun Qiu, Yue Aoki, Yoshimitsu Kataoka, Hirokatsu Sensors (Basel) Article In the field of embodied AI, vision-and-language navigation (VLN) is a crucial and challenging multi-modal task. Specifically, outdoor VLN involves an agent navigating within a graph-based environment, while simultaneously interpreting information from real-world urban environments and natural language instructions. Existing outdoor VLN models predict actions using a combination of panorama and instruction features. However, these methods may cause the agent to struggle to understand complicated outdoor environments and ignore the details in the environments to fail to navigate. Human navigation often involves the use of specific objects as reference landmarks when navigating to unfamiliar places, providing a more rational and efficient approach to navigation. Inspired by this natural human behavior, we propose an object-level alignment module (OAlM), which guides the agent to focus more on object tokens mentioned in the instructions and recognize these landmarks during navigation. By treating these landmarks as sub-goals, our method effectively decomposes a long-range path into a series of shorter paths, ultimately improving the agent’s overall performance. In addition to enabling better object recognition and alignment, our proposed OAlM also fosters a more robust and adaptable agent capable of navigating complex environments. This adaptability is particularly crucial for real-world applications where environmental conditions can be unpredictable and varied. Experimental results show our OAlM is a more object-focused model, and our approach outperforms all metrics on a challenging outdoor VLN Touchdown dataset, exceeding the baseline by 3.19% on task completion (TC). These results highlight the potential of leveraging object-level information in the form of sub-goals to improve navigation performance in embodied AI systems, paving the way for more advanced and efficient outdoor navigation. MDPI 2023-06-29 /pmc/articles/PMC10346337/ /pubmed/37447877 http://dx.doi.org/10.3390/s23136028 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Sun, Yanjun
Qiu, Yue
Aoki, Yoshimitsu
Kataoka, Hirokatsu
Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
title Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
title_full Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
title_fullStr Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
title_full_unstemmed Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
title_short Outdoor Vision-and-Language Navigation Needs Object-Level Alignment
title_sort outdoor vision-and-language navigation needs object-level alignment
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10346337/
https://www.ncbi.nlm.nih.gov/pubmed/37447877
http://dx.doi.org/10.3390/s23136028
work_keys_str_mv AT sunyanjun outdoorvisionandlanguagenavigationneedsobjectlevelalignment
AT qiuyue outdoorvisionandlanguagenavigationneedsobjectlevelalignment
AT aokiyoshimitsu outdoorvisionandlanguagenavigationneedsobjectlevelalignment
AT kataokahirokatsu outdoorvisionandlanguagenavigationneedsobjectlevelalignment