Cargando…

Multimodal interaction enhanced representation learning for video emotion recognition

Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show...

Descripción completa

Detalles Bibliográficos
Autores principales:	Xia, Xiaohan, Zhao, Yong, Jiang, Dongmei
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2022
Materias:	Neuroscience
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9806211/ https://www.ncbi.nlm.nih.gov/pubmed/36601594 http://dx.doi.org/10.3389/fnins.2022.1086380

_version_	1784862485001535488
author	Xia, Xiaohan Zhao, Yong Jiang, Dongmei
author_facet	Xia, Xiaohan Zhao, Yong Jiang, Dongmei
author_sort	Xia, Xiaohan
collection	PubMed
description	Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities.
format	Online Article Text
id	pubmed-9806211
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-98062112023-01-03 Multimodal interaction enhanced representation learning for video emotion recognition Xia, Xiaohan Zhao, Yong Jiang, Dongmei Front Neurosci Neuroscience Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities. Frontiers Media S.A. 2022-12-19 /pmc/articles/PMC9806211/ /pubmed/36601594 http://dx.doi.org/10.3389/fnins.2022.1086380 Text en Copyright © 2022 Xia, Zhao and Jiang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Neuroscience Xia, Xiaohan Zhao, Yong Jiang, Dongmei Multimodal interaction enhanced representation learning for video emotion recognition
title	Multimodal interaction enhanced representation learning for video emotion recognition
title_full	Multimodal interaction enhanced representation learning for video emotion recognition
title_fullStr	Multimodal interaction enhanced representation learning for video emotion recognition
title_full_unstemmed	Multimodal interaction enhanced representation learning for video emotion recognition
title_short	Multimodal interaction enhanced representation learning for video emotion recognition
title_sort	multimodal interaction enhanced representation learning for video emotion recognition
topic	Neuroscience
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9806211/ https://www.ncbi.nlm.nih.gov/pubmed/36601594 http://dx.doi.org/10.3389/fnins.2022.1086380
work_keys_str_mv	AT xiaxiaohan multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition AT zhaoyong multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition AT jiangdongmei multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition

Multimodal interaction enhanced representation learning for video emotion recognition

Ejemplares similares