Cargando…

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition

In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous res...

Descripción completa

Detalles Bibliográficos
Autores principales: Deng, Fei, Deng, Lihong, Jiang, Peifan, Zhang, Gexiang, Yang, Qiang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9920758/
https://www.ncbi.nlm.nih.gov/pubmed/36772243
http://dx.doi.org/10.3390/s23031203
_version_ 1784887148264030208
author Deng, Fei
Deng, Lihong
Jiang, Peifan
Zhang, Gexiang
Yang, Qiang
author_facet Deng, Fei
Deng, Lihong
Jiang, Peifan
Zhang, Gexiang
Yang, Qiang
author_sort Deng, Fei
collection PubMed
description In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods.
format Online
Article
Text
id pubmed-9920758
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-99207582023-02-12 ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang Sensors (Basel) Article In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods. MDPI 2023-01-20 /pmc/articles/PMC9920758/ /pubmed/36772243 http://dx.doi.org/10.3390/s23031203 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Deng, Fei
Deng, Lihong
Jiang, Peifan
Zhang, Gexiang
Yang, Qiang
ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_full ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_fullStr ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_full_unstemmed ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_short ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_sort ressknet-ssdp: effective and light end-to-end architecture for speaker recognition
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9920758/
https://www.ncbi.nlm.nih.gov/pubmed/36772243
http://dx.doi.org/10.3390/s23031203
work_keys_str_mv AT dengfei ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition
AT denglihong ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition
AT jiangpeifan ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition
AT zhanggexiang ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition
AT yangqiang ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition