Cargando…

Attention-Based Temporal-Frequency Aggregation for Speaker Verification

Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Meng, Feng, Dazheng, Su, Tingting, Chen, Mohan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8953125/
https://www.ncbi.nlm.nih.gov/pubmed/35336315
http://dx.doi.org/10.3390/s22062147
_version_ 1784675773911662592
author Wang, Meng
Feng, Dazheng
Su, Tingting
Chen, Mohan
author_facet Wang, Meng
Feng, Dazheng
Su, Tingting
Chen, Mohan
author_sort Wang, Meng
collection PubMed
description Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models.
format Online
Article
Text
id pubmed-8953125
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-89531252022-03-26 Attention-Based Temporal-Frequency Aggregation for Speaker Verification Wang, Meng Feng, Dazheng Su, Tingting Chen, Mohan Sensors (Basel) Article Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models. MDPI 2022-03-10 /pmc/articles/PMC8953125/ /pubmed/35336315 http://dx.doi.org/10.3390/s22062147 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Wang, Meng
Feng, Dazheng
Su, Tingting
Chen, Mohan
Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_full Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_fullStr Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_full_unstemmed Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_short Attention-Based Temporal-Frequency Aggregation for Speaker Verification
title_sort attention-based temporal-frequency aggregation for speaker verification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8953125/
https://www.ncbi.nlm.nih.gov/pubmed/35336315
http://dx.doi.org/10.3390/s22062147
work_keys_str_mv AT wangmeng attentionbasedtemporalfrequencyaggregationforspeakerverification
AT fengdazheng attentionbasedtemporalfrequencyaggregationforspeakerverification
AT sutingting attentionbasedtemporalfrequencyaggregationforspeakerverification
AT chenmohan attentionbasedtemporalfrequencyaggregationforspeakerverification