Cargando…
Attention-Based Temporal-Frequency Aggregation for Speaker Verification
Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8953125/ https://www.ncbi.nlm.nih.gov/pubmed/35336315 http://dx.doi.org/10.3390/s22062147 |
_version_ | 1784675773911662592 |
---|---|
author | Wang, Meng Feng, Dazheng Su, Tingting Chen, Mohan |
author_facet | Wang, Meng Feng, Dazheng Su, Tingting Chen, Mohan |
author_sort | Wang, Meng |
collection | PubMed |
description | Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models. |
format | Online Article Text |
id | pubmed-8953125 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-89531252022-03-26 Attention-Based Temporal-Frequency Aggregation for Speaker Verification Wang, Meng Feng, Dazheng Su, Tingting Chen, Mohan Sensors (Basel) Article Convolutional neural networks (CNNs) have significantly promoted the development of speaker verification (SV) systems because of their powerful deep feature learning capability. In CNN-based SV systems, utterance-level aggregation is an important component, and it compresses the frame-level features generated by the CNN frontend into an utterance-level representation. However, most of the existing aggregation methods aggregate the extracted features across time and cannot capture the speaker-dependent information contained in the frequency domain. To handle this problem, this paper proposes a novel attention-based frequency aggregation method, which focuses on the key frequency bands that provide more information for utterance-level representation. Meanwhile, two more effective temporal-frequency aggregation methods are proposed in combination with the existing temporal aggregation methods. The two proposed methods can capture the speaker-dependent information contained in both the time domain and frequency domain of frame-level features, thus improving the discriminability of speaker embedding. Besides, a powerful CNN-based SV system is developed and evaluated on the TIMIT and Voxceleb datasets. The experimental results indicate that the CNN-based SV system using the temporal-frequency aggregation method achieves a superior equal error rate of 5.96% on Voxceleb compared with the state-of-the-art baseline models. MDPI 2022-03-10 /pmc/articles/PMC8953125/ /pubmed/35336315 http://dx.doi.org/10.3390/s22062147 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Wang, Meng Feng, Dazheng Su, Tingting Chen, Mohan Attention-Based Temporal-Frequency Aggregation for Speaker Verification |
title | Attention-Based Temporal-Frequency Aggregation for Speaker Verification |
title_full | Attention-Based Temporal-Frequency Aggregation for Speaker Verification |
title_fullStr | Attention-Based Temporal-Frequency Aggregation for Speaker Verification |
title_full_unstemmed | Attention-Based Temporal-Frequency Aggregation for Speaker Verification |
title_short | Attention-Based Temporal-Frequency Aggregation for Speaker Verification |
title_sort | attention-based temporal-frequency aggregation for speaker verification |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8953125/ https://www.ncbi.nlm.nih.gov/pubmed/35336315 http://dx.doi.org/10.3390/s22062147 |
work_keys_str_mv | AT wangmeng attentionbasedtemporalfrequencyaggregationforspeakerverification AT fengdazheng attentionbasedtemporalfrequencyaggregationforspeakerverification AT sutingting attentionbasedtemporalfrequencyaggregationforspeakerverification AT chenmohan attentionbasedtemporalfrequencyaggregationforspeakerverification |