Cargando…

Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms

Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challeng...

Descripción completa

Detalles Bibliográficos
Autores principales: Tursunov, Anvarjon, Mustaqeem, Choeh, Joon Yeon, Kwon, Soonil
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8434188/
https://www.ncbi.nlm.nih.gov/pubmed/34502785
http://dx.doi.org/10.3390/s21175892
_version_ 1783751539044646912
author Tursunov, Anvarjon
Mustaqeem,
Choeh, Joon Yeon
Kwon, Soonil
author_facet Tursunov, Anvarjon
Mustaqeem,
Choeh, Joon Yeon
Kwon, Soonil
author_sort Tursunov, Anvarjon
collection PubMed
description Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals.
format Online
Article
Text
id pubmed-8434188
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-84341882021-09-12 Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms Tursunov, Anvarjon Mustaqeem, Choeh, Joon Yeon Kwon, Soonil Sensors (Basel) Article Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals. MDPI 2021-09-01 /pmc/articles/PMC8434188/ /pubmed/34502785 http://dx.doi.org/10.3390/s21175892 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Tursunov, Anvarjon
Mustaqeem,
Choeh, Joon Yeon
Kwon, Soonil
Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms
title Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms
title_full Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms
title_fullStr Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms
title_full_unstemmed Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms
title_short Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms
title_sort age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8434188/
https://www.ncbi.nlm.nih.gov/pubmed/34502785
http://dx.doi.org/10.3390/s21175892
work_keys_str_mv AT tursunovanvarjon ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms
AT mustaqeem ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms
AT choehjoonyeon ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms
AT kwonsoonil ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms