Cargando…

LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Feng, Shen, Si-Yuan, Fu, Zi-Wang, Wang, Han-Yang, Zhou, Ai-Min, Qi, Jia-Yin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9316084/ https://www.ncbi.nlm.nih.gov/pubmed/35885233 http://dx.doi.org/10.3390/e24071010

_version_	1784754718867718144
author	Liu, Feng Shen, Si-Yuan Fu, Zi-Wang Wang, Han-Yang Zhou, Ai-Min Qi, Jia-Yin
author_facet	Liu, Feng Shen, Si-Yuan Fu, Zi-Wang Wang, Han-Yang Zhou, Ai-Min Qi, Jia-Yin
author_sort	Liu, Feng
collection	PubMed
description	Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product.
format	Online Article Text
id	pubmed-9316084
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-93160842022-07-27 LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition Liu, Feng Shen, Si-Yuan Fu, Zi-Wang Wang, Han-Yang Zhou, Ai-Min Qi, Jia-Yin Entropy (Basel) Article Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product. MDPI 2022-07-21 /pmc/articles/PMC9316084/ /pubmed/35885233 http://dx.doi.org/10.3390/e24071010 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Liu, Feng Shen, Si-Yuan Fu, Zi-Wang Wang, Han-Yang Zhou, Ai-Min Qi, Jia-Yin LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_full	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_fullStr	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_full_unstemmed	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_short	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_sort	lgcct: a light gated and crossed complementation transformer for multimodal speech emotion recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9316084/ https://www.ncbi.nlm.nih.gov/pubmed/35885233 http://dx.doi.org/10.3390/e24071010
work_keys_str_mv	AT liufeng lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT shensiyuan lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT fuziwang lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT wanghanyang lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT zhouaimin lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT qijiayin lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition

LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Ejemplares similares