Cargando…

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition

In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous res...

Descripción completa

Detalles Bibliográficos
Autores principales:	Deng, Fei, Deng, Lihong, Jiang, Peifan, Zhang, Gexiang, Yang, Qiang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9920758/ https://www.ncbi.nlm.nih.gov/pubmed/36772243 http://dx.doi.org/10.3390/s23031203

_version_	1784887148264030208
author	Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang
author_facet	Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang
author_sort	Deng, Fei
collection	PubMed
description	In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods.
format	Online Article Text
id	pubmed-9920758
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-99207582023-02-12 ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang Sensors (Basel) Article In speaker recognition tasks, convolutional neural network (CNN)-based approaches have shown significant success. Modeling the long-term contexts and efficiently aggregating the information are two challenges in speaker recognition, and they have a critical impact on system performance. Previous research has addressed these issues by introducing deeper, wider, and more complex network architectures and aggregation methods. However, it is difficult to significantly improve the performance with these approaches because they also have trouble fully utilizing global information, channel information, and time-frequency information. To address the above issues, we propose a lighter and more efficient CNN-based end-to-end speaker recognition architecture, ResSKNet-SSDP. ResSKNet-SSDP consists of a residual selective kernel network (ResSKNet) and self-attentive standard deviation pooling (SSDP). ResSKNet can capture long-term contexts, neighboring information, and global information, thus extracting a more informative frame-level. SSDP can capture short- and long-term changes in frame-level features, aggregating the variable-length frame-level features into fixed-length, more distinctive utterance-level features. Extensive comparison experiments were performed on two popular public speaker recognition datasets, Voxceleb and CN-Celeb, with current state-of-the-art speaker recognition systems and achieved the lowest EER/DCF of 2.33%/0.2298, 2.44%/0.2559, 4.10%/0.3502, and 12.28%/0.5051. Compared with the lightest x-vector, our designed ResSKNet-SSDP has 3.1 M fewer parameters and 31.6 ms less inference time, but 35.1% better performance. The results show that ResSKNet-SSDP significantly outperforms the current state-of-the-art speaker recognition architectures on all test sets and is an end-to-end architecture with fewer parameters and higher efficiency for applications in realistic situations. The ablation experiments further show that our proposed approaches also provide significant improvements over previous methods. MDPI 2023-01-20 /pmc/articles/PMC9920758/ /pubmed/36772243 http://dx.doi.org/10.3390/s23031203 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Deng, Fei Deng, Lihong Jiang, Peifan Zhang, Gexiang Yang, Qiang ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title	ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_full	ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_fullStr	ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_full_unstemmed	ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_short	ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition
title_sort	ressknet-ssdp: effective and light end-to-end architecture for speaker recognition
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9920758/ https://www.ncbi.nlm.nih.gov/pubmed/36772243 http://dx.doi.org/10.3390/s23031203
work_keys_str_mv	AT dengfei ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT denglihong ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT jiangpeifan ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT zhanggexiang ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition AT yangqiang ressknetssdpeffectiveandlightendtoendarchitectureforspeakerrecognition

ResSKNet-SSDP: Effective and Light End-To-End Architecture for Speaker Recognition

Ejemplares similares