Cargando…

Deep Modular Bilinear Attention Network for Visual Question Answering

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modali...

Descripción completa

Detalles Bibliográficos
Autores principales: Yan, Feng, Silamu, Wushouer, Li, Yanbing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8838230/
https://www.ncbi.nlm.nih.gov/pubmed/35161790
http://dx.doi.org/10.3390/s22031045
_version_ 1784650074924515328
author Yan, Feng
Silamu, Wushouer
Li, Yanbing
author_facet Yan, Feng
Silamu, Wushouer
Li, Yanbing
author_sort Yan, Feng
collection PubMed
description VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.
format Online
Article
Text
id pubmed-8838230
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-88382302022-02-13 Deep Modular Bilinear Attention Network for Visual Question Answering Yan, Feng Silamu, Wushouer Li, Yanbing Sensors (Basel) Article VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0. MDPI 2022-01-28 /pmc/articles/PMC8838230/ /pubmed/35161790 http://dx.doi.org/10.3390/s22031045 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Yan, Feng
Silamu, Wushouer
Li, Yanbing
Deep Modular Bilinear Attention Network for Visual Question Answering
title Deep Modular Bilinear Attention Network for Visual Question Answering
title_full Deep Modular Bilinear Attention Network for Visual Question Answering
title_fullStr Deep Modular Bilinear Attention Network for Visual Question Answering
title_full_unstemmed Deep Modular Bilinear Attention Network for Visual Question Answering
title_short Deep Modular Bilinear Attention Network for Visual Question Answering
title_sort deep modular bilinear attention network for visual question answering
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8838230/
https://www.ncbi.nlm.nih.gov/pubmed/35161790
http://dx.doi.org/10.3390/s22031045
work_keys_str_mv AT yanfeng deepmodularbilinearattentionnetworkforvisualquestionanswering
AT silamuwushouer deepmodularbilinearattentionnetworkforvisualquestionanswering
AT liyanbing deepmodularbilinearattentionnetworkforvisualquestionanswering