Cargando…
Deep Modular Bilinear Attention Network for Visual Question Answering
VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modali...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8838230/ https://www.ncbi.nlm.nih.gov/pubmed/35161790 http://dx.doi.org/10.3390/s22031045 |
_version_ | 1784650074924515328 |
---|---|
author | Yan, Feng Silamu, Wushouer Li, Yanbing |
author_facet | Yan, Feng Silamu, Wushouer Li, Yanbing |
author_sort | Yan, Feng |
collection | PubMed |
description | VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0. |
format | Online Article Text |
id | pubmed-8838230 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-88382302022-02-13 Deep Modular Bilinear Attention Network for Visual Question Answering Yan, Feng Silamu, Wushouer Li, Yanbing Sensors (Basel) Article VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0. MDPI 2022-01-28 /pmc/articles/PMC8838230/ /pubmed/35161790 http://dx.doi.org/10.3390/s22031045 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Yan, Feng Silamu, Wushouer Li, Yanbing Deep Modular Bilinear Attention Network for Visual Question Answering |
title | Deep Modular Bilinear Attention Network for Visual Question Answering |
title_full | Deep Modular Bilinear Attention Network for Visual Question Answering |
title_fullStr | Deep Modular Bilinear Attention Network for Visual Question Answering |
title_full_unstemmed | Deep Modular Bilinear Attention Network for Visual Question Answering |
title_short | Deep Modular Bilinear Attention Network for Visual Question Answering |
title_sort | deep modular bilinear attention network for visual question answering |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8838230/ https://www.ncbi.nlm.nih.gov/pubmed/35161790 http://dx.doi.org/10.3390/s22031045 |
work_keys_str_mv | AT yanfeng deepmodularbilinearattentionnetworkforvisualquestionanswering AT silamuwushouer deepmodularbilinearattentionnetworkforvisualquestionanswering AT liyanbing deepmodularbilinearattentionnetworkforvisualquestionanswering |