Cargando…

Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models

Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal....

Descripción completa

Detalles Bibliográficos
Autores principales: Ali, Mohamed Nabih, Falavigna, Daniele, Brutti, Alessio
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8749591/
https://www.ncbi.nlm.nih.gov/pubmed/35009917
http://dx.doi.org/10.3390/s22010374
_version_ 1784631266844344320
author Ali, Mohamed Nabih
Falavigna, Daniele
Brutti, Alessio
author_facet Ali, Mohamed Nabih
Falavigna, Daniele
Brutti, Alessio
author_sort Ali, Mohamed Nabih
collection PubMed
description Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches.
format Online
Article
Text
id pubmed-8749591
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-87495912022-01-12 Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models Ali, Mohamed Nabih Falavigna, Daniele Brutti, Alessio Sensors (Basel) Article Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches. MDPI 2022-01-04 /pmc/articles/PMC8749591/ /pubmed/35009917 http://dx.doi.org/10.3390/s22010374 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Ali, Mohamed Nabih
Falavigna, Daniele
Brutti, Alessio
Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_full Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_fullStr Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_full_unstemmed Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_short Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models
title_sort time-domain joint training strategies of speech enhancement and intent classification neural models
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8749591/
https://www.ncbi.nlm.nih.gov/pubmed/35009917
http://dx.doi.org/10.3390/s22010374
work_keys_str_mv AT alimohamednabih timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels
AT falavignadaniele timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels
AT bruttialessio timedomainjointtrainingstrategiesofspeechenhancementandintentclassificationneuralmodels