Cargando…

Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification

Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utt...

Descripción completa

Detalles Bibliográficos
Autores principales: Moufidi, Abderrazzaq, Rousseau, David, Rasti, Pejman
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10346165/
https://www.ncbi.nlm.nih.gov/pubmed/37447739
http://dx.doi.org/10.3390/s23135890
_version_ 1785073249804091392
author Moufidi, Abderrazzaq
Rousseau, David
Rasti, Pejman
author_facet Moufidi, Abderrazzaq
Rousseau, David
Rasti, Pejman
author_sort Moufidi, Abderrazzaq
collection PubMed
description Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of  [Formula: see text]. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification.
format Online
Article
Text
id pubmed-10346165
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-103461652023-07-15 Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification Moufidi, Abderrazzaq Rousseau, David Rasti, Pejman Sensors (Basel) Article Multimodal deep learning, in the context of biometrics, encounters significant challenges due to the dependence on long speech utterances and RGB images, which are often impractical in certain situations. This paper presents a novel solution addressing these issues by leveraging ultrashort voice utterances and depth videos of the lip for person identification. The proposed method utilizes an amalgamation of residual neural networks to encode depth videos and a Time Delay Neural Network architecture to encode voice signals. In an effort to fuse information from these different modalities, we integrate self-attention and engineer a noise-resistant model that effectively manages diverse types of noise. Through rigorous testing on a benchmark dataset, our approach exhibits superior performance over existing methods, resulting in an average improvement of  [Formula: see text]. This method is notably efficient for scenarios where extended utterances and RGB images are unfeasible or unattainable. Furthermore, its potential extends to various multimodal applications beyond just person identification. MDPI 2023-06-25 /pmc/articles/PMC10346165/ /pubmed/37447739 http://dx.doi.org/10.3390/s23135890 Text en © 2023 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Moufidi, Abderrazzaq
Rousseau, David
Rasti, Pejman
Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
title Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
title_full Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
title_fullStr Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
title_full_unstemmed Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
title_short Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification
title_sort attention-based fusion of ultrashort voice utterances and depth videos for multimodal person identification
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10346165/
https://www.ncbi.nlm.nih.gov/pubmed/37447739
http://dx.doi.org/10.3390/s23135890
work_keys_str_mv AT moufidiabderrazzaq attentionbasedfusionofultrashortvoiceutterancesanddepthvideosformultimodalpersonidentification
AT rousseaudavid attentionbasedfusionofultrashortvoiceutterancesanddepthvideosformultimodalpersonidentification
AT rastipejman attentionbasedfusionofultrashortvoiceutterancesanddepthvideosformultimodalpersonidentification