Cargando…

Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition

Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and hum...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mocanu, Bogdan, Tapu, Ruxandra, Zaharia, Titus
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8234042/ https://www.ncbi.nlm.nih.gov/pubmed/34203112 http://dx.doi.org/10.3390/s21124233

_version_	1783713991062716416
author	Mocanu, Bogdan Tapu, Ruxandra Zaharia, Titus
author_facet	Mocanu, Bogdan Tapu, Ruxandra Zaharia, Titus
author_sort	Mocanu, Bogdan
collection	PubMed
description	Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and human–machine/robot interaction. In this paper, we introduce a novel speech emotion recognition method, based on the Squeeze and Excitation ResNet (SE-ResNet) model and fed with spectrogram inputs. In order to overcome the limitations of the state-of-the-art techniques, which fail in providing a robust feature representation at the utterance level, the CNN architecture is extended with a trainable discriminative GhostVLAD clustering layer that aggregates the audio features into compact, single-utterance vector representation. In addition, an end-to-end neural embedding approach is introduced, based on an emotionally constrained triplet loss function. The loss function integrates the relations between the various emotional patterns and thus improves the latent space data representation. The proposed methodology achieves 83.35% and 64.92% global accuracy rates on the RAVDESS and CREMA-D publicly available datasets, respectively. When compared with the results provided by human observers, the gains in global accuracy scores are superior to 24%. Finally, the objective comparative evaluation with state-of-the-art techniques demonstrates accuracy gains of more than 3%.
format	Online Article Text
id	pubmed-8234042
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-82340422021-06-27 Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition Mocanu, Bogdan Tapu, Ruxandra Zaharia, Titus Sensors (Basel) Article Emotion is a form of high-level paralinguistic information that is intrinsically conveyed by human speech. Automatic speech emotion recognition is an essential challenge for various applications; including mental disease diagnosis; audio surveillance; human behavior understanding; e-learning and human–machine/robot interaction. In this paper, we introduce a novel speech emotion recognition method, based on the Squeeze and Excitation ResNet (SE-ResNet) model and fed with spectrogram inputs. In order to overcome the limitations of the state-of-the-art techniques, which fail in providing a robust feature representation at the utterance level, the CNN architecture is extended with a trainable discriminative GhostVLAD clustering layer that aggregates the audio features into compact, single-utterance vector representation. In addition, an end-to-end neural embedding approach is introduced, based on an emotionally constrained triplet loss function. The loss function integrates the relations between the various emotional patterns and thus improves the latent space data representation. The proposed methodology achieves 83.35% and 64.92% global accuracy rates on the RAVDESS and CREMA-D publicly available datasets, respectively. When compared with the results provided by human observers, the gains in global accuracy scores are superior to 24%. Finally, the objective comparative evaluation with state-of-the-art techniques demonstrates accuracy gains of more than 3%. MDPI 2021-06-20 /pmc/articles/PMC8234042/ /pubmed/34203112 http://dx.doi.org/10.3390/s21124233 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Mocanu, Bogdan Tapu, Ruxandra Zaharia, Titus Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
title	Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
title_full	Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
title_fullStr	Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
title_full_unstemmed	Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
title_short	Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition
title_sort	utterance level feature aggregation with deep metric learning for speech emotion recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8234042/ https://www.ncbi.nlm.nih.gov/pubmed/34203112 http://dx.doi.org/10.3390/s21124233
work_keys_str_mv	AT mocanubogdan utterancelevelfeatureaggregationwithdeepmetriclearningforspeechemotionrecognition AT tapuruxandra utterancelevelfeatureaggregationwithdeepmetriclearningforspeechemotionrecognition AT zahariatitus utterancelevelfeatureaggregationwithdeepmetriclearningforspeechemotionrecognition

Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition

Ejemplares similares