Cargando…

Modality attention fusion model with hybrid multi-head self-attention for video understanding

Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on th...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhuang, Xuqiang, Liu, Fang’ai, Hou, Jian, Hao, Jianhua, Cai, Xiaohong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/
https://www.ncbi.nlm.nih.gov/pubmed/36201513
http://dx.doi.org/10.1371/journal.pone.0275156
_version_ 1784803003279081472
author Zhuang, Xuqiang
Liu, Fang’ai
Hou, Jian
Hao, Jianhua
Cai, Xiaohong
author_facet Zhuang, Xuqiang
Liu, Fang’ai
Hou, Jian
Hao, Jianhua
Cai, Xiaohong
author_sort Zhuang, Xuqiang
collection PubMed
description Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.
format Online
Article
Text
id pubmed-9536548
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-95365482022-10-07 Modality attention fusion model with hybrid multi-head self-attention for video understanding Zhuang, Xuqiang Liu, Fang’ai Hou, Jian Hao, Jianhua Cai, Xiaohong PLoS One Research Article Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results. Public Library of Science 2022-10-06 /pmc/articles/PMC9536548/ /pubmed/36201513 http://dx.doi.org/10.1371/journal.pone.0275156 Text en © 2022 Zhuang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Zhuang, Xuqiang
Liu, Fang’ai
Hou, Jian
Hao, Jianhua
Cai, Xiaohong
Modality attention fusion model with hybrid multi-head self-attention for video understanding
title Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_full Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_fullStr Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_full_unstemmed Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_short Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_sort modality attention fusion model with hybrid multi-head self-attention for video understanding
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/
https://www.ncbi.nlm.nih.gov/pubmed/36201513
http://dx.doi.org/10.1371/journal.pone.0275156
work_keys_str_mv AT zhuangxuqiang modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT liufangai modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT houjian modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT haojianhua modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT caixiaohong modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding