Cargando…

ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster

Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challengi...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Huan, Li, Lixuan, Zha, Xupeng, Wang, Yujiang, Xie, Zhaoxin, Zhang, Zixing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10223526/
https://www.ncbi.nlm.nih.gov/pubmed/37430691
http://dx.doi.org/10.3390/s23104777
_version_ 1785049963195006976
author Zhao, Huan
Li, Lixuan
Zha, Xupeng
Wang, Yujiang
Xie, Zhaoxin
Zhang, Zixing
author_facet Zhao, Huan
Li, Lixuan
Zha, Xupeng
Wang, Yujiang
Xie, Zhaoxin
Zhang, Zixing
author_sort Zhao, Huan
collection PubMed
description Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challenging when using feature extractors designed for images or texts. In this paper, we propose a novel semi-supervised framework for extracting spatial and temporal features from speech, called the ACG-EmoCluster. This framework is equipped with a feature extractor for simultaneously extracting the spatial and temporal features, as well as a clustering classifier for enhancing the speech representations through unsupervised learning. Specifically, the feature extractor combines an Attn–Convolution neural network and a Bidirectional Gated Recurrent Unit (BiGRU). The Attn–Convolution network enjoys a global spatial receptive field and can be generalized to the convolution block of any neural networks according to the data scale. The BiGRU is conducive to learning temporal information on a small-scale dataset, thereby alleviating data dependence. The experimental results on the MSP-Podcast demonstrate that our ACG-EmoCluster can capture effective speech representation and outperform all baselines in both supervised and semi-supervised SER tasks.
format Online
Article
Text
id pubmed-10223526
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-102235262023-05-28 ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing Sensors (Basel) Article Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challenging when using feature extractors designed for images or texts. In this paper, we propose a novel semi-supervised framework for extracting spatial and temporal features from speech, called the ACG-EmoCluster. This framework is equipped with a feature extractor for simultaneously extracting the spatial and temporal features, as well as a clustering classifier for enhancing the speech representations through unsupervised learning. Specifically, the feature extractor combines an Attn–Convolution neural network and a Bidirectional Gated Recurrent Unit (BiGRU). The Attn–Convolution network enjoys a global spatial receptive field and can be generalized to the convolution block of any neural networks according to the data scale. The BiGRU is conducive to learning temporal information on a small-scale dataset, thereby alleviating data dependence. The experimental results on the MSP-Podcast demonstrate that our ACG-EmoCluster can capture effective speech representation and outperform all baselines in both supervised and semi-supervised SER tasks. MDPI 2023-05-16 /pmc/articles/PMC10223526/ /pubmed/37430691 http://dx.doi.org/10.3390/s23104777 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Zhao, Huan
Li, Lixuan
Zha, Xupeng
Wang, Yujiang
Xie, Zhaoxin
Zhang, Zixing
ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_full ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_fullStr ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_full_unstemmed ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_short ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_sort acg-emocluster: a novel framework to capture spatial and temporal information from emotional speech enhanced by deepcluster
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10223526/
https://www.ncbi.nlm.nih.gov/pubmed/37430691
http://dx.doi.org/10.3390/s23104777
work_keys_str_mv AT zhaohuan acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster
AT lilixuan acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster
AT zhaxupeng acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster
AT wangyujiang acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster
AT xiezhaoxin acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster
AT zhangzixing acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster