Cargando…

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio sep...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Guizhu, Fu, Min, Sun, Mengnan, Liu, Xuefeng, Zheng, Bing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647675/
https://www.ncbi.nlm.nih.gov/pubmed/37960477
http://dx.doi.org/10.3390/s23218770
_version_ 1785135162799947776
author Li, Guizhu
Fu, Min
Sun, Mengnan
Liu, Xuefeng
Zheng, Bing
author_facet Li, Guizhu
Fu, Min
Sun, Mengnan
Liu, Xuefeng
Zheng, Bing
author_sort Li, Guizhu
collection PubMed
description The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.
format Online
Article
Text
id pubmed-10647675
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-106476752023-10-27 A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing Sensors (Basel) Article The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR. MDPI 2023-10-27 /pmc/articles/PMC10647675/ /pubmed/37960477 http://dx.doi.org/10.3390/s23218770 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Li, Guizhu
Fu, Min
Sun, Mengnan
Liu, Xuefeng
Zheng, Bing
A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_full A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_fullStr A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_full_unstemmed A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_short A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_sort facial feature and lip movement enhanced audio-visual speech separation model
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647675/
https://www.ncbi.nlm.nih.gov/pubmed/37960477
http://dx.doi.org/10.3390/s23218770
work_keys_str_mv AT liguizhu afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT fumin afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT sunmengnan afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT liuxuefeng afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT zhengbing afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT liguizhu facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT fumin facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT sunmengnan facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT liuxuefeng facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel
AT zhengbing facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel