Cargando…

Deep Modular Bilinear Attention Network for Visual Question Answering

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modali...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yan, Feng, Silamu, Wushouer, Li, Yanbing
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8838230/ https://www.ncbi.nlm.nih.gov/pubmed/35161790 http://dx.doi.org/10.3390/s22031045

_version_	1784650074924515328
author	Yan, Feng Silamu, Wushouer Li, Yanbing
author_facet	Yan, Feng Silamu, Wushouer Li, Yanbing
author_sort	Yan, Feng
collection	PubMed
description	VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.
format	Online Article Text
id	pubmed-8838230
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-88382302022-02-13 Deep Modular Bilinear Attention Network for Visual Question Answering Yan, Feng Silamu, Wushouer Li, Yanbing Sensors (Basel) Article VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0. MDPI 2022-01-28 /pmc/articles/PMC8838230/ /pubmed/35161790 http://dx.doi.org/10.3390/s22031045 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Yan, Feng Silamu, Wushouer Li, Yanbing Deep Modular Bilinear Attention Network for Visual Question Answering
title	Deep Modular Bilinear Attention Network for Visual Question Answering
title_full	Deep Modular Bilinear Attention Network for Visual Question Answering
title_fullStr	Deep Modular Bilinear Attention Network for Visual Question Answering
title_full_unstemmed	Deep Modular Bilinear Attention Network for Visual Question Answering
title_short	Deep Modular Bilinear Attention Network for Visual Question Answering
title_sort	deep modular bilinear attention network for visual question answering
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8838230/ https://www.ncbi.nlm.nih.gov/pubmed/35161790 http://dx.doi.org/10.3390/s22031045
work_keys_str_mv	AT yanfeng deepmodularbilinearattentionnetworkforvisualquestionanswering AT silamuwushouer deepmodularbilinearattentionnetworkforvisualquestionanswering AT liyanbing deepmodularbilinearattentionnetworkforvisualquestionanswering

Deep Modular Bilinear Attention Network for Visual Question Answering

Ejemplares similares