Cargando…
ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster
Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challengi...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10223526/ https://www.ncbi.nlm.nih.gov/pubmed/37430691 http://dx.doi.org/10.3390/s23104777 |
_version_ | 1785049963195006976 |
---|---|
author | Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing |
author_facet | Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing |
author_sort | Zhao, Huan |
collection | PubMed |
description | Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challenging when using feature extractors designed for images or texts. In this paper, we propose a novel semi-supervised framework for extracting spatial and temporal features from speech, called the ACG-EmoCluster. This framework is equipped with a feature extractor for simultaneously extracting the spatial and temporal features, as well as a clustering classifier for enhancing the speech representations through unsupervised learning. Specifically, the feature extractor combines an Attn–Convolution neural network and a Bidirectional Gated Recurrent Unit (BiGRU). The Attn–Convolution network enjoys a global spatial receptive field and can be generalized to the convolution block of any neural networks according to the data scale. The BiGRU is conducive to learning temporal information on a small-scale dataset, thereby alleviating data dependence. The experimental results on the MSP-Podcast demonstrate that our ACG-EmoCluster can capture effective speech representation and outperform all baselines in both supervised and semi-supervised SER tasks. |
format | Online Article Text |
id | pubmed-10223526 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-102235262023-05-28 ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing Sensors (Basel) Article Speech emotion recognition (SER) is a task that tailors a matching function between the speech features and the emotion labels. Speech data have higher information saturation than images and stronger temporal coherence than text. This makes entirely and effectively learning speech features challenging when using feature extractors designed for images or texts. In this paper, we propose a novel semi-supervised framework for extracting spatial and temporal features from speech, called the ACG-EmoCluster. This framework is equipped with a feature extractor for simultaneously extracting the spatial and temporal features, as well as a clustering classifier for enhancing the speech representations through unsupervised learning. Specifically, the feature extractor combines an Attn–Convolution neural network and a Bidirectional Gated Recurrent Unit (BiGRU). The Attn–Convolution network enjoys a global spatial receptive field and can be generalized to the convolution block of any neural networks according to the data scale. The BiGRU is conducive to learning temporal information on a small-scale dataset, thereby alleviating data dependence. The experimental results on the MSP-Podcast demonstrate that our ACG-EmoCluster can capture effective speech representation and outperform all baselines in both supervised and semi-supervised SER tasks. MDPI 2023-05-16 /pmc/articles/PMC10223526/ /pubmed/37430691 http://dx.doi.org/10.3390/s23104777 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Zhao, Huan Li, Lixuan Zha, Xupeng Wang, Yujiang Xie, Zhaoxin Zhang, Zixing ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster |
title | ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster |
title_full | ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster |
title_fullStr | ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster |
title_full_unstemmed | ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster |
title_short | ACG-EmoCluster: A Novel Framework to Capture Spatial and Temporal Information from Emotional Speech Enhanced by DeepCluster |
title_sort | acg-emocluster: a novel framework to capture spatial and temporal information from emotional speech enhanced by deepcluster |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10223526/ https://www.ncbi.nlm.nih.gov/pubmed/37430691 http://dx.doi.org/10.3390/s23104777 |
work_keys_str_mv | AT zhaohuan acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT lilixuan acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT zhaxupeng acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT wangyujiang acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT xiezhaoxin acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster AT zhangzixing acgemoclusteranovelframeworktocapturespatialandtemporalinformationfromemotionalspeechenhancedbydeepcluster |