Cargando…
ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous res...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9920758/ https://www.ncbi.nlm.nih.gov/pubmed/36772243 http://dx.doi.org/10.3390/s23031203 |
_version_ | 1784887148264030208 |
---|---|
author | Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang |
author_facet | Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang |
author_sort | Deng, Fei |
collection | PubMed |
description | In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods. |
format | Online Article Text |
id | pubmed-9920758 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-99207582023-02-12 ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang Sensors (Basel) Article In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods. MDPI 2023-01-20 /pmc/articles/PMC9920758/ /pubmed/36772243 http://dx.doi.org/10.3390/s23031203 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition |
title | ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition |
title_full | ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition |
title_fullStr | ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition |
title_full_unstemmed | ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition |
title_short | ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition |
title_sort | ressknet-ssdp: effective and light end-to-end architecture for speaker recognition |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9920758/ https://www.ncbi.nlm.nih.gov/pubmed/36772243 http://dx.doi.org/10.3390/s23031203 |
work_keys_str_mv | AT dengfei ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT denglihong ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT jiangpeifan ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT zhanggexiang ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT yangqiang ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition |