Cargando…

Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition

The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in di...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Xiaodong, Li, Songyang, Wang, Miao
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Hindawi 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8487826/ https://www.ncbi.nlm.nih.gov/pubmed/34616444 http://dx.doi.org/10.1155/2021/5585041

_version_	1784578035124535296
author	Liu, Xiaodong Li, Songyang Wang, Miao
author_facet	Liu, Xiaodong Li, Songyang Wang, Miao
author_sort	Liu, Xiaodong
collection	PubMed
description	The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset.
format	Online Article Text
id	pubmed-8487826
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Hindawi
record_format	MEDLINE/PubMed
spelling	pubmed-84878262021-10-05 Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition Liu, Xiaodong Li, Songyang Wang, Miao Comput Intell Neurosci Research Article The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset. Hindawi 2021-09-25 /pmc/articles/PMC8487826/ /pubmed/34616444 http://dx.doi.org/10.1155/2021/5585041 Text en Copyright © 2021 Xiaodong Liu et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Liu, Xiaodong Li, Songyang Wang, Miao Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
title	Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
title_full	Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
title_fullStr	Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
title_full_unstemmed	Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
title_short	Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition
title_sort	hierarchical attention-based multimodal fusion network for video emotion recognition
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8487826/ https://www.ncbi.nlm.nih.gov/pubmed/34616444 http://dx.doi.org/10.1155/2021/5585041
work_keys_str_mv	AT liuxiaodong hierarchicalattentionbasedmultimodalfusionnetworkforvideoemotionrecognition AT lisongyang hierarchicalattentionbasedmultimodalfusionnetworkforvideoemotionrecognition AT wangmiao hierarchicalattentionbasedmultimodalfusionnetworkforvideoemotionrecognition

Hierarchical Attention-Based Multimodal Fusion Network for Video Emotion Recognition

Ejemplares similares