Cargando…

ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster

Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challengi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhao, Huan, Li, Lixuan, Zha, Xupeng, Wang, Yujiang, Xie, Zhaoxin, Zhang, Zixing
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10223526/ https://www.ncbi.nlm.nih.gov/pubmed/37430691 http://dx.doi.org/10.3390/s23104777

_version_	1785049963195006976
author	Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing
author_facet	Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing
author_sort	Zhao, Huan
collection	PubMed
description	Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challenging when using feature extractors designed for images or texts. In this paper, we propose a novel semi-supervised framework for extracting spatial and temporal features from speech, called the ACG-EmoCluster. This framework is equipped with a feature extractor for simultaneously extracting the spatial and temporal features, as well as a clustering classifier for enhancing the speech representations through unsupervised learning. Specifically, the feature extractor combines an Attn–Convolution neural network and a Bidirectional Gated Recurrent Unit (BiGRU). The Attn–Convolution network enjoys a global spatial receptive field and can be generalized to the convolution block of any neural networks according to the data scale. The BiGRU is conducive to learning temporal information on a small-scale dataset, thereby alleviating data dependence. The experimental results on the MSP-Podcast demonstrate that our ACG-EmoCluster can capture effective speech representation and outperform all baselines in both supervised and semi-supervised SER tasks.
format	Online Article Text
id	pubmed-10223526
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-102235262023-05-28 ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing Sensors (Basel) Article Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challenging when using feature extractors designed for images or texts. In this paper, we propose a novel semi-supervised framework for extracting spatial and temporal features from speech, called the ACG-EmoCluster. This framework is equipped with a feature extractor for simultaneously extracting the spatial and temporal features, as well as a clustering classifier for enhancing the speech representations through unsupervised learning. Specifically, the feature extractor combines an Attn–Convolution neural network and a Bidirectional Gated Recurrent Unit (BiGRU). The Attn–Convolution network enjoys a global spatial receptive field and can be generalized to the convolution block of any neural networks according to the data scale. The BiGRU is conducive to learning temporal information on a small-scale dataset, thereby alleviating data dependence. The experimental results on the MSP-Podcast demonstrate that our ACG-EmoCluster can capture effective speech representation and outperform all baselines in both supervised and semi-supervised SER tasks. MDPI 2023-05-16 /pmc/articles/PMC10223526/ /pubmed/37430691 http://dx.doi.org/10.3390/s23104777 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title	ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_full	ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_fullStr	ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_full_unstemmed	ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_short	ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
title_sort	acg-emocluster: a novel framework to capture spatial and temporal information from emotional speech enhanced by deepcluster
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10223526/ https://www.ncbi.nlm.nih.gov/pubmed/37430691 http://dx.doi.org/10.3390/s23104777
work_keys_str_mv	AT zhaohuan acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT lilixuan acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT zhaxupeng acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT wangyujiang acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT xiezhaoxin acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT zhangzixing acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster

ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster

Ejemplares similares