Cargando…

Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention

The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic...

Descripción completa

Detalles Bibliográficos
Autores principales: Jin, Ye, Wang, Mei, Luo, Liyan, Zhao, Dinghao, Liu, Zhanqi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9503981/
https://www.ncbi.nlm.nih.gov/pubmed/36146166
http://dx.doi.org/10.3390/s22186818
_version_ 1784796101192187904
author Jin, Ye
Wang, Mei
Luo, Liyan
Zhao, Dinghao
Liu, Zhanqi
author_facet Jin, Ye
Wang, Mei
Luo, Liyan
Zhao, Dinghao
Liu, Zhanqi
author_sort Jin, Ye
collection PubMed
description The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal–frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED.
format Online
Article
Text
id pubmed-9503981
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-95039812022-09-24 Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention Jin, Ye Wang, Mei Luo, Liyan Zhao, Dinghao Liu, Zhanqi Sensors (Basel) Article The complexity of polyphonic sounds imposes numerous challenges on their classification. Especially in real life, polyphonic sound events have discontinuity and unstable time-frequency variations. Traditional single acoustic features cannot characterize the key feature information of the polyphonic sound event, and this deficiency results in poor model classification performance. In this paper, we propose a convolutional recurrent neural network model based on the temporal-frequency (TF) attention mechanism and feature space (FS) attention mechanism (TFFS-CRNN). The TFFS-CRNN model aggregates Log-Mel spectrograms and MFCCs feature as inputs, which contains the TF-attention module, the convolutional recurrent neural network (CRNN) module, the FS-attention module and the bidirectional gated recurrent unit (BGRU) module. In polyphonic sound events detection (SED), the TF-attention module can capture the critical temporal–frequency features more capably. The FS-attention module assigns different dynamically learnable weights to different dimensions of features. The TFFS-CRNN model improves the characterization of features for key feature information in polyphonic SED. By using two attention modules, the model can focus on semantically relevant time frames, key frequency bands, and important feature spaces. Finally, the BGRU module learns contextual information. The experiments were conducted on the DCASE 2016 Task3 dataset and the DCASE 2017 Task3 dataset. Experimental results show that the F1-score of the TFFS-CRNN model improved 12.4% and 25.2% compared with winning system models in DCASE challenge; the ER is reduced by 0.41 and 0.37 as well. The proposed TFFS-CRNN model algorithm has better classification performance and lower ER in polyphonic SED. MDPI 2022-09-09 /pmc/articles/PMC9503981/ /pubmed/36146166 http://dx.doi.org/10.3390/s22186818 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Jin, Ye
Wang, Mei
Luo, Liyan
Zhao, Dinghao
Liu, Zhanqi
Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
title Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
title_full Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
title_fullStr Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
title_full_unstemmed Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
title_short Polyphonic Sound Event Detection Using Temporal-Frequency Attention and Feature Space Attention
title_sort polyphonic sound event detection using temporal-frequency attention and feature space attention
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9503981/
https://www.ncbi.nlm.nih.gov/pubmed/36146166
http://dx.doi.org/10.3390/s22186818
work_keys_str_mv AT jinye polyphonicsoundeventdetectionusingtemporalfrequencyattentionandfeaturespaceattention
AT wangmei polyphonicsoundeventdetectionusingtemporalfrequencyattentionandfeaturespaceattention
AT luoliyan polyphonicsoundeventdetectionusingtemporalfrequencyattentionandfeaturespaceattention
AT zhaodinghao polyphonicsoundeventdetectionusingtemporalfrequencyattentionandfeaturespaceattention
AT liuzhanqi polyphonicsoundeventdetectionusingtemporalfrequencyattentionandfeaturespaceattention