Cargando…

A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Multimodal emotion recognition (MER) refers to the identification and understanding of human emotional states by combining different signals, including—but not limited to—text, speech, and face cues. MER plays a crucial role in the human–computer interaction (HCI) domain. With the recent progression...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lian, Hailun, Lu, Cheng, Li, Sunan, Zhao, Yan, Tang, Chuangao, Zong, Yuan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Review
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10606253/ https://www.ncbi.nlm.nih.gov/pubmed/37895561 http://dx.doi.org/10.3390/e25101440

_version_	1785127272284422144
author	Lian, Hailun Lu, Cheng Li, Sunan Zhao, Yan Tang, Chuangao Zong, Yuan
author_facet	Lian, Hailun Lu, Cheng Li, Sunan Zhao, Yan Tang, Chuangao Zong, Yuan
author_sort	Lian, Hailun
collection	PubMed
description	Multimodal emotion recognition (MER) refers to the identification and understanding of human emotional states by combining different signals, including—but not limited to—text, speech, and face cues. MER plays a crucial role in the human–computer interaction (HCI) domain. With the recent progression of deep learning technologies and the increasing availability of multimodal datasets, the MER domain has witnessed considerable development, resulting in numerous significant research breakthroughs. However, a conspicuous absence of thorough and focused reviews on these deep learning-based MER achievements is observed. This survey aims to bridge this gap by providing a comprehensive overview of the recent advancements in MER based on deep learning. For an orderly exposition, this paper first outlines a meticulous analysis of the current multimodal datasets, emphasizing their advantages and constraints. Subsequently, we thoroughly scrutinize diverse methods for multimodal emotional feature extraction, highlighting the merits and demerits of each method. Moreover, we perform an exhaustive analysis of various MER algorithms, with particular focus on the model-agnostic fusion methods (including early fusion, late fusion, and hybrid fusion) and fusion based on intermediate layers of deep models (encompassing simple concatenation fusion, utterance-level interaction fusion, and fine-grained interaction fusion). We assess the strengths and weaknesses of these fusion strategies, providing guidance to researchers to help them select the most suitable techniques for their studies. In summary, this survey aims to provide a thorough and insightful review of the field of deep learning-based MER. It is intended as a valuable guide to aid researchers in furthering the evolution of this dynamic and impactful field.
format	Online Article Text
id	pubmed-10606253
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-106062532023-10-28 A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face Lian, Hailun Lu, Cheng Li, Sunan Zhao, Yan Tang, Chuangao Zong, Yuan Entropy (Basel) Review Multimodal emotion recognition (MER) refers to the identification and understanding of human emotional states by combining different signals, including—but not limited to—text, speech, and face cues. MER plays a crucial role in the human–computer interaction (HCI) domain. With the recent progression of deep learning technologies and the increasing availability of multimodal datasets, the MER domain has witnessed considerable development, resulting in numerous significant research breakthroughs. However, a conspicuous absence of thorough and focused reviews on these deep learning-based MER achievements is observed. This survey aims to bridge this gap by providing a comprehensive overview of the recent advancements in MER based on deep learning. For an orderly exposition, this paper first outlines a meticulous analysis of the current multimodal datasets, emphasizing their advantages and constraints. Subsequently, we thoroughly scrutinize diverse methods for multimodal emotional feature extraction, highlighting the merits and demerits of each method. Moreover, we perform an exhaustive analysis of various MER algorithms, with particular focus on the model-agnostic fusion methods (including early fusion, late fusion, and hybrid fusion) and fusion based on intermediate layers of deep models (encompassing simple concatenation fusion, utterance-level interaction fusion, and fine-grained interaction fusion). We assess the strengths and weaknesses of these fusion strategies, providing guidance to researchers to help them select the most suitable techniques for their studies. In summary, this survey aims to provide a thorough and insightful review of the field of deep learning-based MER. It is intended as a valuable guide to aid researchers in furthering the evolution of this dynamic and impactful field. MDPI 2023-10-12 /pmc/articles/PMC10606253/ /pubmed/37895561 http://dx.doi.org/10.3390/e25101440 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Review Lian, Hailun Lu, Cheng Li, Sunan Zhao, Yan Tang, Chuangao Zong, Yuan A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
title	A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
title_full	A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
title_fullStr	A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
title_full_unstemmed	A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
title_short	A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face
title_sort	survey of deep learning-based multimodal emotion recognition: speech, text, and face
topic	Review
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10606253/ https://www.ncbi.nlm.nih.gov/pubmed/37895561 http://dx.doi.org/10.3390/e25101440
work_keys_str_mv	AT lianhailun asurveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT lucheng asurveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT lisunan asurveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT zhaoyan asurveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT tangchuangao asurveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT zongyuan asurveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT lianhailun surveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT lucheng surveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT lisunan surveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT zhaoyan surveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT tangchuangao surveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface AT zongyuan surveyofdeeplearningbasedmultimodalemotionrecognitionspeechtextandface

A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Ejemplares similares