Cargando…
A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model
The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio sep...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647675/ https://www.ncbi.nlm.nih.gov/pubmed/37960477 http://dx.doi.org/10.3390/s23218770 |
_version_ | 1785135162799947776 |
---|---|
author | Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing |
author_facet | Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing |
author_sort | Li, Guizhu |
collection | PubMed |
description | The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR. |
format | Online Article Text |
id | pubmed-10647675 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-106476752023-10-27 A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing Sensors (Basel) Article The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR. MDPI 2023-10-27 /pmc/articles/PMC10647675/ /pubmed/37960477 http://dx.doi.org/10.3390/s23218770 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Li, Guizhu Fu, Min Sun, Mengnan Liu, Xuefeng Zheng, Bing A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model |
title | A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model |
title_full | A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model |
title_fullStr | A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model |
title_full_unstemmed | A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model |
title_short | A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model |
title_sort | facial feature and lip movement enhanced audio-visual speech separation model |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10647675/ https://www.ncbi.nlm.nih.gov/pubmed/37960477 http://dx.doi.org/10.3390/s23218770 |
work_keys_str_mv | AT liguizhu afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT fumin afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT sunmengnan afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT liuxuefeng afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT zhengbing afacialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT liguizhu facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT fumin facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT sunmengnan facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT liuxuefeng facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel AT zhengbing facialfeatureandlipmovementenhancedaudiovisualspeechseparationmodel |