Cargando…

Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation

The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acousti...

Descripción completa

Detalles Bibliográficos
Autores principales: Tao, Huawei, Shan, Shuai, Hu, Ziyi, Zhu, Chunhua, Ge, Hongyi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857941/
https://www.ncbi.nlm.nih.gov/pubmed/36673208
http://dx.doi.org/10.3390/e25010068
_version_ 1784873974315876352
author Tao, Huawei
Shan, Shuai
Hu, Ziyi
Zhu, Chunhua
Ge, Hongyi
author_facet Tao, Huawei
Shan, Shuai
Hu, Ziyi
Zhu, Chunhua
Ge, Hongyi
author_sort Tao, Huawei
collection PubMed
description The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2–9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method.
format Online
Article
Text
id pubmed-9857941
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-98579412023-01-21 Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation Tao, Huawei Shan, Shuai Hu, Ziyi Zhu, Chunhua Ge, Hongyi Entropy (Basel) Article The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2–9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method. MDPI 2022-12-30 /pmc/articles/PMC9857941/ /pubmed/36673208 http://dx.doi.org/10.3390/e25010068 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Tao, Huawei
Shan, Shuai
Hu, Ziyi
Zhu, Chunhua
Ge, Hongyi
Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
title Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
title_full Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
title_fullStr Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
title_full_unstemmed Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
title_short Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
title_sort strong generalized speech emotion recognition based on effective data augmentation
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857941/
https://www.ncbi.nlm.nih.gov/pubmed/36673208
http://dx.doi.org/10.3390/e25010068
work_keys_str_mv AT taohuawei stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation
AT shanshuai stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation
AT huziyi stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation
AT zhuchunhua stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation
AT gehongyi stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation