Cargando…

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio sep...

Descripción completa

Detalles Bibliográficos
Autores principales:	Li, Guizhu, Fu, Min, Sun, Mengnan, Liu, Xuefeng, Zheng, Bing
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647675/ https://www.ncbi.nlm.nih.gov/pubmed/37960477 http://dx.doi.org/10.3390/s23218770

_version_	1785135162799947776
author	Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing
author_facet	Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing
author_sort	Li, Guizhu
collection	PubMed
description	The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.
format	Online Article Text
id	pubmed-10647675
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-106476752023-10-27 A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing Sensors (Basel) Article The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR. MDPI 2023-10-27 /pmc/articles/PMC10647675/ /pubmed/37960477 http://dx.doi.org/10.3390/s23218770 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title	A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_full	A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_fullStr	A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_full_unstemmed	A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_short	A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
title_sort	facial feature and lip movement enhanced audio-visual speech separation model
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647675/ https://www.ncbi.nlm.nih.gov/pubmed/37960477 http://dx.doi.org/10.3390/s23218770
work_keys_str_mv	AT liguizhu afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT fumin afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT sunmengnan afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT liuxuefeng afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT zhengbing afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT liguizhu facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT fumin facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT sunmengnan facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT liuxuefeng facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT zhengbing facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Ejemplares similares