Cargando…

Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes

Speech with high sound quality and little noise is central to many of our communication tools, including calls, video conferencing and hearing aids. While human ratings provide the best measure of sound quality, they are costly and time-intensive to gather, thus computational metrics are typically u...

Descripción completa

Detalles Bibliográficos
Autores principales: Diehl, Peter Udo, Thorbergsson, Leifur, Singer, Yosef, Skripniuk, Vladislav, Pudszuhn, Annett, Hofmann, Veit M., Sprengel, Elias, Meyer-Rachner, Paul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9704549/
https://www.ncbi.nlm.nih.gov/pubmed/36441711
http://dx.doi.org/10.1371/journal.pone.0278170
_version_ 1784840078355333120
author Diehl, Peter Udo
Thorbergsson, Leifur
Singer, Yosef
Skripniuk, Vladislav
Pudszuhn, Annett
Hofmann, Veit M.
Sprengel, Elias
Meyer-Rachner, Paul
author_facet Diehl, Peter Udo
Thorbergsson, Leifur
Singer, Yosef
Skripniuk, Vladislav
Pudszuhn, Annett
Hofmann, Veit M.
Sprengel, Elias
Meyer-Rachner, Paul
author_sort Diehl, Peter Udo
collection PubMed
description Speech with high sound quality and little noise is central to many of our communication tools, including calls, video conferencing and hearing aids. While human ratings provide the best measure of sound quality, they are costly and time-intensive to gather, thus computational metrics are typically used instead. Here we present a non-intrusive, deep learning-based metric that takes only a sound sample as an input and returns ratings in three categories: overall quality, noise, and sound quality. This metric is available via a web API and is composed of a deep neural network ensemble with 5 networks that use either ResNet-26 architectures with STFT inputs or fully-connected networks with wav2vec features as inputs. The networks are trained and tested on over 1 million crowd-sourced human sound ratings across the three categories. Correlations of our metric with human ratings exceed or match other state-of-the-art metrics on 51 out of 56 benchmark scenes, while not requiring clean speech reference samples as opposed to metrics that are performing well on the other 5 scenes. The benchmark scenes represent a wide variety of acoustic environments and a large selection of post-processing methods that include classical methods (e.g. Wiener-filtering) and newer deep-learning methods.
format Online
Article
Text
id pubmed-9704549
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-97045492022-11-29 Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes Diehl, Peter Udo Thorbergsson, Leifur Singer, Yosef Skripniuk, Vladislav Pudszuhn, Annett Hofmann, Veit M. Sprengel, Elias Meyer-Rachner, Paul PLoS One Research Article Speech with high sound quality and little noise is central to many of our communication tools, including calls, video conferencing and hearing aids. While human ratings provide the best measure of sound quality, they are costly and time-intensive to gather, thus computational metrics are typically used instead. Here we present a non-intrusive, deep learning-based metric that takes only a sound sample as an input and returns ratings in three categories: overall quality, noise, and sound quality. This metric is available via a web API and is composed of a deep neural network ensemble with 5 networks that use either ResNet-26 architectures with STFT inputs or fully-connected networks with wav2vec features as inputs. The networks are trained and tested on over 1 million crowd-sourced human sound ratings across the three categories. Correlations of our metric with human ratings exceed or match other state-of-the-art metrics on 51 out of 56 benchmark scenes, while not requiring clean speech reference samples as opposed to metrics that are performing well on the other 5 scenes. The benchmark scenes represent a wide variety of acoustic environments and a large selection of post-processing methods that include classical methods (e.g. Wiener-filtering) and newer deep-learning methods. Public Library of Science 2022-11-28 /pmc/articles/PMC9704549/ /pubmed/36441711 http://dx.doi.org/10.1371/journal.pone.0278170 Text en © 2022 Diehl et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Diehl, Peter Udo
Thorbergsson, Leifur
Singer, Yosef
Skripniuk, Vladislav
Pudszuhn, Annett
Hofmann, Veit M.
Sprengel, Elias
Meyer-Rachner, Paul
Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes
title Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes
title_full Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes
title_fullStr Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes
title_full_unstemmed Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes
title_short Non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes
title_sort non-intrusive deep learning-based computational speech metrics with high-accuracy across a wide range of acoustic scenes
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9704549/
https://www.ncbi.nlm.nih.gov/pubmed/36441711
http://dx.doi.org/10.1371/journal.pone.0278170
work_keys_str_mv AT diehlpeterudo nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes
AT thorbergssonleifur nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes
AT singeryosef nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes
AT skripniukvladislav nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes
AT pudszuhnannett nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes
AT hofmannveitm nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes
AT sprengelelias nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes
AT meyerrachnerpaul nonintrusivedeeplearningbasedcomputationalspeechmetricswithhighaccuracyacrossawiderangeofacousticscenes