Cargando…

Multimodal interaction enhanced representation learning for video emotion recognition

Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show...

Descripción completa

Detalles Bibliográficos
Autores principales: Xia, Xiaohan, Zhao, Yong, Jiang, Dongmei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9806211/
https://www.ncbi.nlm.nih.gov/pubmed/36601594
http://dx.doi.org/10.3389/fnins.2022.1086380
_version_ 1784862485001535488
author Xia, Xiaohan
Zhao, Yong
Jiang, Dongmei
author_facet Xia, Xiaohan
Zhao, Yong
Jiang, Dongmei
author_sort Xia, Xiaohan
collection PubMed
description Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities.
format Online
Article
Text
id pubmed-9806211
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-98062112023-01-03 Multimodal interaction enhanced representation learning for video emotion recognition Xia, Xiaohan Zhao, Yong Jiang, Dongmei Front Neurosci Neuroscience Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities. Frontiers Media S.A. 2022-12-19 /pmc/articles/PMC9806211/ /pubmed/36601594 http://dx.doi.org/10.3389/fnins.2022.1086380 Text en Copyright © 2022 Xia, Zhao and Jiang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Neuroscience
Xia, Xiaohan
Zhao, Yong
Jiang, Dongmei
Multimodal interaction enhanced representation learning for video emotion recognition
title Multimodal interaction enhanced representation learning for video emotion recognition
title_full Multimodal interaction enhanced representation learning for video emotion recognition
title_fullStr Multimodal interaction enhanced representation learning for video emotion recognition
title_full_unstemmed Multimodal interaction enhanced representation learning for video emotion recognition
title_short Multimodal interaction enhanced representation learning for video emotion recognition
title_sort multimodal interaction enhanced representation learning for video emotion recognition
topic Neuroscience
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9806211/
https://www.ncbi.nlm.nih.gov/pubmed/36601594
http://dx.doi.org/10.3389/fnins.2022.1086380
work_keys_str_mv AT xiaxiaohan multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition
AT zhaoyong multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition
AT jiangdongmei multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition