Cargando…

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, m...

Descripción completa

Detalles Bibliográficos
Autores principales:	Anvarjon, Tursunov, Mustaqeem, Kwon, Soonil
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7570673/ https://www.ncbi.nlm.nih.gov/pubmed/32932723 http://dx.doi.org/10.3390/s20185212

_version_	1783597001337733120
author	Anvarjon, Tursunov Mustaqeem, Kwon, Soonil
author_facet	Anvarjon, Tursunov Mustaqeem, Kwon, Soonil
author_sort	Anvarjon, Tursunov
collection	PubMed
description	Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.
format	Online Article Text
id	pubmed-7570673
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-75706732020-10-28 Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features Anvarjon, Tursunov Mustaqeem, Kwon, Soonil Sensors (Basel) Article Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems. MDPI 2020-09-12 /pmc/articles/PMC7570673/ /pubmed/32932723 http://dx.doi.org/10.3390/s20185212 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Anvarjon, Tursunov Mustaqeem, Kwon, Soonil Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title	Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_full	Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_fullStr	Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_full_unstemmed	Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_short	Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_sort	deep-net: a lightweight cnn-based speech emotion recognition system using deep frequency features
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7570673/ https://www.ncbi.nlm.nih.gov/pubmed/32932723 http://dx.doi.org/10.3390/s20185212
work_keys_str_mv	AT anvarjontursunov deepnetalightweightcnnbasedspeechemotionrecognitionsystemusingdeepfrequencyfeatures AT mustaqeem deepnetalightweightcnnbasedspeechemotionrecognitionsystemusingdeepfrequencyfeatures AT kwonsoonil deepnetalightweightcnnbasedspeechemotionrecognitionsystemusingdeepfrequencyfeatures

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Ejemplares similares