Cargando…
Multimodal interaction enhanced representation learning for video emotion recognition
Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9806211/ https://www.ncbi.nlm.nih.gov/pubmed/36601594 http://dx.doi.org/10.3389/fnins.2022.1086380 |
_version_ | 1784862485001535488 |
---|---|
author | Xia, Xiaohan Zhao, Yong Jiang, Dongmei |
author_facet | Xia, Xiaohan Zhao, Yong Jiang, Dongmei |
author_sort | Xia, Xiaohan |
collection | PubMed |
description | Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities. |
format | Online Article Text |
id | pubmed-9806211 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-98062112023-01-03 Multimodal interaction enhanced representation learning for video emotion recognition Xia, Xiaohan Zhao, Yong Jiang, Dongmei Front Neurosci Neuroscience Video emotion recognition aims to infer human emotional states from the audio, visual, and text modalities. Previous approaches are centered around designing sophisticated fusion mechanisms, but usually ignore the fact that text contains global semantic information, while speech and face video show more fine-grained temporal dynamics of emotion. From the perspective of cognitive sciences, the process of emotion expression, either through facial expression or speech, is implicitly regulated by high-level semantics. Inspired by this fact, we propose a multimodal interaction enhanced representation learning framework for emotion recognition from face video, where a semantic enhancement module is first designed to guide the audio/visual encoder using the semantic information from text, then the multimodal bottleneck Transformer is adopted to further reinforce the audio and visual representations by modeling the cross-modal dynamic interactions between the two feature sequences. Experimental results on two benchmark emotion databases indicate the superiority of our proposed method. With the semantic enhanced audio and visual features, it outperforms the state-of-the-art models which fuse the features or decisions from the audio, visual and text modalities. Frontiers Media S.A. 2022-12-19 /pmc/articles/PMC9806211/ /pubmed/36601594 http://dx.doi.org/10.3389/fnins.2022.1086380 Text en Copyright © 2022 Xia, Zhao and Jiang. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Neuroscience Xia, Xiaohan Zhao, Yong Jiang, Dongmei Multimodal interaction enhanced representation learning for video emotion recognition |
title | Multimodal interaction enhanced representation learning for video emotion recognition |
title_full | Multimodal interaction enhanced representation learning for video emotion recognition |
title_fullStr | Multimodal interaction enhanced representation learning for video emotion recognition |
title_full_unstemmed | Multimodal interaction enhanced representation learning for video emotion recognition |
title_short | Multimodal interaction enhanced representation learning for video emotion recognition |
title_sort | multimodal interaction enhanced representation learning for video emotion recognition |
topic | Neuroscience |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9806211/ https://www.ncbi.nlm.nih.gov/pubmed/36601594 http://dx.doi.org/10.3389/fnins.2022.1086380 |
work_keys_str_mv | AT xiaxiaohan multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition AT zhaoyong multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition AT jiangdongmei multimodalinteractionenhancedrepresentationlearningforvideoemotionrecognition |