Cargando…
Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms
Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challeng...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8434188/ https://www.ncbi.nlm.nih.gov/pubmed/34502785 http://dx.doi.org/10.3390/s21175892 |
_version_ | 1783751539044646912 |
---|---|
author | Tursunov, Anvarjon Mustaqeem, Choeh, Joon Yeon Kwon, Soonil |
author_facet | Tursunov, Anvarjon Mustaqeem, Choeh, Joon Yeon Kwon, Soonil |
author_sort | Tursunov, Anvarjon |
collection | PubMed |
description | Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals. |
format | Online Article Text |
id | pubmed-8434188 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-84341882021-09-12 Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms Tursunov, Anvarjon Mustaqeem, Choeh, Joon Yeon Kwon, Soonil Sensors (Basel) Article Speech signals are being used as a primary input source in human–computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals. MDPI 2021-09-01 /pmc/articles/PMC8434188/ /pubmed/34502785 http://dx.doi.org/10.3390/s21175892 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Tursunov, Anvarjon Mustaqeem, Choeh, Joon Yeon Kwon, Soonil Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms |
title | Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms |
title_full | Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms |
title_fullStr | Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms |
title_full_unstemmed | Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms |
title_short | Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms |
title_sort | age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8434188/ https://www.ncbi.nlm.nih.gov/pubmed/34502785 http://dx.doi.org/10.3390/s21175892 |
work_keys_str_mv | AT tursunovanvarjon ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms AT mustaqeem ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms AT choehjoonyeon ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms AT kwonsoonil ageandgenderrecognitionusingaconvolutionalneuralnetworkwithaspeciallydesignedmultiattentionmodulethroughspeechspectrograms |