Cargando…

Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans

Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to...

Descripción completa

Detalles Bibliográficos
Autores principales: Varano, Enrico, Vougioukas, Konstantinos, Ma, Pingchuan, Petridis, Stavros, Pantic, Maja, Reichenbach, Tobias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8766421/
https://www.ncbi.nlm.nih.gov/pubmed/35069100
http://dx.doi.org/10.3389/fnins.2021.781196
_version_ 1784634526809456640
author Varano, Enrico
Vougioukas, Konstantinos
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
Reichenbach, Tobias
author_facet Varano, Enrico
Vougioukas, Konstantinos
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
Reichenbach, Tobias
author_sort Varano, Enrico
collection PubMed
description Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.
format Online
Article
Text
id pubmed-8766421
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-87664212022-01-20 Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans Varano, Enrico Vougioukas, Konstantinos Ma, Pingchuan Petridis, Stavros Pantic, Maja Reichenbach, Tobias Front Neurosci Neuroscience Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments. Frontiers Media S.A. 2022-01-05 /pmc/articles/PMC8766421/ /pubmed/35069100 http://dx.doi.org/10.3389/fnins.2021.781196 Text en Copyright © 2022 Varano, Vougioukas, Ma, Petridis, Pantic and Reichenbach. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Neuroscience
Varano, Enrico
Vougioukas, Konstantinos
Ma, Pingchuan
Petridis, Stavros
Pantic, Maja
Reichenbach, Tobias
Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans
title Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans
title_full Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans
title_fullStr Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans
title_full_unstemmed Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans
title_short Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans
title_sort speech-driven facial animations improve speech-in-noise comprehension of humans
topic Neuroscience
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8766421/
https://www.ncbi.nlm.nih.gov/pubmed/35069100
http://dx.doi.org/10.3389/fnins.2021.781196
work_keys_str_mv AT varanoenrico speechdrivenfacialanimationsimprovespeechinnoisecomprehensionofhumans
AT vougioukaskonstantinos speechdrivenfacialanimationsimprovespeechinnoisecomprehensionofhumans
AT mapingchuan speechdrivenfacialanimationsimprovespeechinnoisecomprehensionofhumans
AT petridisstavros speechdrivenfacialanimationsimprovespeechinnoisecomprehensionofhumans
AT panticmaja speechdrivenfacialanimationsimprovespeechinnoisecomprehensionofhumans
AT reichenbachtobias speechdrivenfacialanimationsimprovespeechinnoisecomprehensionofhumans