Cargando…
Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation
The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acousti...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857941/ https://www.ncbi.nlm.nih.gov/pubmed/36673208 http://dx.doi.org/10.3390/e25010068 |
_version_ | 1784873974315876352 |
---|---|
author | Tao, Huawei Shan, Shuai Hu, Ziyi Zhu, Chunhua Ge, Hongyi |
author_facet | Tao, Huawei Shan, Shuai Hu, Ziyi Zhu, Chunhua Ge, Hongyi |
author_sort | Tao, Huawei |
collection | PubMed |
description | The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2–9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method. |
format | Online Article Text |
id | pubmed-9857941 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-98579412023-01-21 Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation Tao, Huawei Shan, Shuai Hu, Ziyi Zhu, Chunhua Ge, Hongyi Entropy (Basel) Article The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2–9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method. MDPI 2022-12-30 /pmc/articles/PMC9857941/ /pubmed/36673208 http://dx.doi.org/10.3390/e25010068 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Tao, Huawei Shan, Shuai Hu, Ziyi Zhu, Chunhua Ge, Hongyi Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation |
title | Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation |
title_full | Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation |
title_fullStr | Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation |
title_full_unstemmed | Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation |
title_short | Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation |
title_sort | strong generalized speech emotion recognition based on effective data augmentation |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9857941/ https://www.ncbi.nlm.nih.gov/pubmed/36673208 http://dx.doi.org/10.3390/e25010068 |
work_keys_str_mv | AT taohuawei stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation AT shanshuai stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation AT huziyi stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation AT zhuchunhua stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation AT gehongyi stronggeneralizedspeechemotionrecognitionbasedoneffectivedataaugmentation |