Cargando…

Deep Modular Bilinear Attention Network for Visual Question Answering

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modali...

Descripción completa

Detalles Bibliográficos
Autores principales: Yan, Feng, Silamu, Wushouer, Li, Yanbing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8838230/
https://www.ncbi.nlm.nih.gov/pubmed/35161790
http://dx.doi.org/10.3390/s22031045
Descripción
Sumario:VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.