Cargando…
Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal....
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8749591/ https://www.ncbi.nlm.nih.gov/pubmed/35009917 http://dx.doi.org/10.3390/s22010374 |
_version_ | 1784631266844344320 |
---|---|
author | Ali, Mohamed Nabih Falavigna, Daniele Brutti, Alessio |
author_facet | Ali, Mohamed Nabih Falavigna, Daniele Brutti, Alessio |
author_sort | Ali, Mohamed Nabih |
collection | PubMed |
description | Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches. |
format | Online Article Text |
id | pubmed-8749591 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-87495912022-01-12 Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models Ali, Mohamed Nabih Falavigna, Daniele Brutti, Alessio Sensors (Basel) Article Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches. MDPI 2022-01-04 /pmc/articles/PMC8749591/ /pubmed/35009917 http://dx.doi.org/10.3390/s22010374 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Ali, Mohamed Nabih Falavigna, Daniele Brutti, Alessio Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models |
title | Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models |
title_full | Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models |
title_fullStr | Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models |
title_full_unstemmed | Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models |
title_short | Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models |
title_sort | time-domain joint training strategies of speech enhancement and intent classification neural models |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8749591/ https://www.ncbi.nlm.nih.gov/pubmed/35009917 http://dx.doi.org/10.3390/s22010374 |
work_keys_str_mv | AT alimohamednabih timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels AT falavignadaniele timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels AT bruttialessio timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels |