Cargando…
Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face
Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from sp...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
SAGE Publications
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9677167/ https://www.ncbi.nlm.nih.gov/pubmed/36384325 http://dx.doi.org/10.1177/23312165221136934 |
_version_ | 1784833753753845760 |
---|---|
author | Shan, Tong Wenner, Casper E. Xu, Chenliang Duan, Zhiyao Maddox, Ross K. |
author_facet | Shan, Tong Wenner, Casper E. Xu, Chenliang Duan, Zhiyao Maddox, Ross K. |
author_sort | Shan, Tong |
collection | PubMed |
description | Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from speech audio and a single face image. In this study, we aimed to quantify the benefits that such a system can bring to speech comprehension, especially in noise. The target speech audio was masked with signal to noise ratios of −9, −6, −3, and 0 dB and was presented to subjects in three audio-visual (AV) stimulus conditions: (1) synthesized AV: audio with the synthesized talking face movie; (2) natural AV: audio with the original movie from the corpus; and (3) audio-only: audio with a static image of the talker. Subjects were asked to type the sentences they heard in each trial and keyword recognition was quantified for each condition. Overall, performance in the synthesized AV condition fell approximately halfway between the other two conditions, showing a marked improvement over the audio-only control but still falling short of the natural AV condition. Every subject showed some benefit from the synthetic AV stimulus. The results of this study support the idea that a DNN-based model that generates a talking face from speech audio can meaningfully enhance comprehension in noisy environments, and has the potential to be used as a visual hearing aid. |
format | Online Article Text |
id | pubmed-9677167 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | SAGE Publications |
record_format | MEDLINE/PubMed |
spelling | pubmed-96771672022-11-22 Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face Shan, Tong Wenner, Casper E. Xu, Chenliang Duan, Zhiyao Maddox, Ross K. Trends Hear Original Article Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from speech audio and a single face image. In this study, we aimed to quantify the benefits that such a system can bring to speech comprehension, especially in noise. The target speech audio was masked with signal to noise ratios of −9, −6, −3, and 0 dB and was presented to subjects in three audio-visual (AV) stimulus conditions: (1) synthesized AV: audio with the synthesized talking face movie; (2) natural AV: audio with the original movie from the corpus; and (3) audio-only: audio with a static image of the talker. Subjects were asked to type the sentences they heard in each trial and keyword recognition was quantified for each condition. Overall, performance in the synthesized AV condition fell approximately halfway between the other two conditions, showing a marked improvement over the audio-only control but still falling short of the natural AV condition. Every subject showed some benefit from the synthetic AV stimulus. The results of this study support the idea that a DNN-based model that generates a talking face from speech audio can meaningfully enhance comprehension in noisy environments, and has the potential to be used as a visual hearing aid. SAGE Publications 2022-11-16 /pmc/articles/PMC9677167/ /pubmed/36384325 http://dx.doi.org/10.1177/23312165221136934 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by-nc/4.0/This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage). |
spellingShingle | Original Article Shan, Tong Wenner, Casper E. Xu, Chenliang Duan, Zhiyao Maddox, Ross K. Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face |
title | Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face |
title_full | Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face |
title_fullStr | Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face |
title_full_unstemmed | Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face |
title_short | Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face |
title_sort | speech-in-noise comprehension is improved when viewing a deep-neural-network-generated talking face |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9677167/ https://www.ncbi.nlm.nih.gov/pubmed/36384325 http://dx.doi.org/10.1177/23312165221136934 |
work_keys_str_mv | AT shantong speechinnoisecomprehensionisimprovedwhenviewingadeepneuralnetworkgeneratedtalkingface AT wennercaspere speechinnoisecomprehensionisimprovedwhenviewingadeepneuralnetworkgeneratedtalkingface AT xuchenliang speechinnoisecomprehensionisimprovedwhenviewingadeepneuralnetworkgeneratedtalkingface AT duanzhiyao speechinnoisecomprehensionisimprovedwhenviewingadeepneuralnetworkgeneratedtalkingface AT maddoxrossk speechinnoisecomprehensionisimprovedwhenviewingadeepneuralnetworkgeneratedtalkingface |